Design Zoom

Design a video conferencing platform like Zoom that supports real-time video/audio meetings with up to 1,000 participants, screen sharing, recording with transcription, breakout rooms, and webinar mode for 10,000+ attendees. The core architecture centres on SFU (Selective Forwarding Unit) servers for efficient media routing with simulcast/SVC for adaptive quality, WebRTC signalling, and geo-distributed infrastructure for global low-latency meetings.

Scale Estimates

Metric	Value
Daily meeting participants	300 million
Concurrent meetings (peak)	5 million
Concurrent participants (peak)	30 million
SFU servers globally	10,000+
Data centres	20+
Max participants per meeting	1,000 (video), 10,000 (webinar)
Glass-to-glass latency target	< 150ms
Average meeting duration	45 minutes
Recordings per day	10 million
Storage for recordings per day	50 PB

Non-Functional Requirements

Low latency: < 150ms glass-to-glass (camera capture → display on remote screen); achieved via UDP/RTP, SFU forwarding (no encode/decode), regional SFU assignment, simulcast/SVC
Adaptive quality: Per-participant quality adaptation; bandwidth estimation (GCC/SCReAM); simulcast layer selection by SFU; graceful degradation (720p → 360p → 180p → audio-only) based on network
Scale: Millions of concurrent meetings; SFU fleet auto-scaled; cascading SFUs for large meetings across regions; stateless signalling servers horizontally scaled
Reliability: SFU failover (reconnect to backup SFU within 2s); no single point of failure; signalling state in Redis; meeting survives individual server failures
Security: DTLS-SRTP encryption by default; optional E2EE (per-meeting key, SFU is blind relay); meeting passcode + waiting room; host controls; compliance (SOC 2, HIPAA, GDPR)
Recording: Cloud recording (composite MP4); STT transcription + speaker diarisaiton; live captioning < 500ms; access-controlled recording storage

Scale Estimates

Metric

Value

Daily meeting participants

300 million

Concurrent meetings (peak)

5 million

Concurrent participants (peak)

30 million

SFU servers globally

10,000+

Data centres

20+

Max participants per meeting

1,000 (video), 10,000 (webinar)

Glass-to-glass latency target

< 150ms

Average meeting duration

45 minutes

Recordings per day

10 million

Storage for recordings per day

50 PB

Non-Functional Requirements

Low latency: < 150ms glass-to-glass (camera capture → display on remote screen); achieved via UDP/RTP, SFU forwarding (no encode/decode), regional SFU assignment, simulcast/SVC

Adaptive quality: Per-participant quality adaptation; bandwidth estimation (GCC/SCReAM); simulcast layer selection by SFU; graceful degradation (720p → 360p → 180p → audio-only) based on network

Scale: Millions of concurrent meetings; SFU fleet auto-scaled; cascading SFUs for large meetings across regions; stateless signalling servers horizontally scaled

Reliability: SFU failover (reconnect to backup SFU within 2s); no single point of failure; signalling state in Redis; meeting survives individual server failures

Security: DTLS-SRTP encryption by default; optional E2EE (per-meeting key, SFU is blind relay); meeting passcode + waiting room; host controls; compliance (SOC 2, HIPAA, GDPR)

Recording: Cloud recording (composite MP4); STT transcription + speaker diarisaiton; live captioning < 500ms; access-controlled recording storage

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design Zoom

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design Zoom

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How does the real-time media architecture work for video conferencing?

2How does signalling work to establish and manage a video meeting?

3How would you handle large meetings (100–1,000 participants)?

4How would you handle network adaptation and quality control?

5How would you design the recording and transcription system?

6How would you scale the infrastructure globally for millions of concurrent meetings?

7How would you handle security, privacy, and meeting access control?

Key Topics

Asked At

Design Zoom

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How does the real-time media architecture work for video conferencing?

2How does signalling work to establish and manage a video meeting?

3How would you handle large meetings (100–1,000 participants)?

4How would you handle network adaptation and quality control?

5How would you design the recording and transcription system?

6How would you scale the infrastructure globally for millions of concurrent meetings?

7How would you handle security, privacy, and meeting access control?

Key Topics

Asked At