Loading system design...
Design a video conferencing platform like Zoom that supports real-time video/audio meetings with up to 1,000 participants, screen sharing, recording with transcription, breakout rooms, and webinar mode for 10,000+ attendees. The core architecture centres on SFU (Selective Forwarding Unit) servers for efficient media routing with simulcast/SVC for adaptive quality, WebRTC signalling, and geo-distributed infrastructure for global low-latency meetings.
| Metric | Value |
|---|---|
| Daily meeting participants | 300 million |
| Concurrent meetings (peak) | 5 million |
| Concurrent participants (peak) | 30 million |
| SFU servers globally | 10,000+ |
| Data centres | 20+ |
| Max participants per meeting | 1,000 (video), 10,000 (webinar) |
| Glass-to-glass latency target | < 150ms |
| Average meeting duration | 45 minutes |
| Recordings per day | 10 million |
| Storage for recordings per day | 50 PB |
Video conferencing: host real-time video/audio meetings with up to 1,000 participants (100 on video, rest audio-only); < 150ms glass-to-glass latency; adaptive quality (resolution/bitrate) based on each participant's network conditions
Meeting management: create/schedule meetings (one-time or recurring) with a meeting link; join via link, meeting ID + passcode, or calendar invite; host controls: mute/unmute participants, remove, enable waiting room, lock meeting
Screen sharing: any participant can share their screen (full screen, application window, or browser tab) as a video stream; viewers see the shared screen alongside speaker video; support annotation/drawing during share
In-meeting chat: real-time text chat during meetings; send to all or privately to specific participants; share files/links in chat; chat history available during the meeting
Recording: host can record the meeting (video + audio + screen share + chat); recording saved to cloud storage; post-meeting: recording available for download/streaming; transcription from recorded audio
Gallery view and active speaker: display all participants in a grid (gallery view, up to 49 per page) or spotlight the active speaker (largest tile); switch between layouts; dynamic tile resizing based on participant count
Breakout rooms: host can split the meeting into smaller sub-groups; participants auto-assigned or manually assigned to rooms; host can broadcast messages to all rooms; rooms merge back to the main meeting when closed
Virtual backgrounds and noise cancellation: client-side ML to replace/blur the user's background in real-time; AI-powered noise suppression removes background noise (keyboard, dogs, construction) from audio
Webinar mode: one-to-many broadcast for up to 10,000 view-only attendees with a panel of speakers; attendees can raise hand, Q&A, and use reactions; speakers' streams forwarded via CDN to attendees
End-to-end encryption (E2EE): optional E2EE mode where meeting media is encrypted on each participant's device and decrypted only by other participants; Zoom servers cannot access meeting content
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?