Loading system design...
Design a real-time chat application similar to WhatsApp, Facebook Messenger, or Slack. The system supports 1-on-1 direct messages, group chats (up to 500 members), online/offline presence, message history, delivery receipts, and multimedia sharing.
| Metric | Value |
|---|---|
| Daily active users (DAU) | 100 million |
| Messages sent per day | 10 billion (~115,000 per second) |
| Average message size | 200 bytes (text); 1 MB (media) |
| Concurrent WebSocket connections | 50 million (50% of DAU online at peak) |
| Chat servers needed | ~5,000 (10K connections per server) |
| Redis entries (connection registry) | 50 million (one per online user) |
| Message storage per day | 10B × 200B ≈ 2 TB (text only) |
| Message storage per year | ~730 TB |
Support 1-on-1 (direct) messaging between two users with real-time delivery
Support group chats with up to 500 members; any member can send a message visible to all other members
Show online/offline presence status for each user in real-time
Persist all messages; users can view full chat history with pagination (scroll-back) even after reconnecting
Deliver messages in the correct chronological order within a conversation
Show message delivery indicators: Sent (single tick), Delivered (double tick), Read (blue tick)
Support multimedia messages: images, videos, files, and voice notes (upload → CDN → send link in message)
Push notifications for messages received while the app is in the background or the user is offline
Support typing indicators ('User is typing...' shown to the other participant)
End-to-end encryption: messages encrypted on sender's device and decrypted only on recipient's device; server cannot read message content
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?