System Design (HLD)WhatsApp Messaging

WhatsApp Messaging System Design

LevelAdvanced

Duration120 mins

TopicWhatsApp Messaging

1 / 6

Requirements: 1:1 and Group Messaging

Designing Communication at Planetary Scale

WhatsApp processes over 100 billion messages daily, serving more than 2 billion active users across virtually every country on Earth. When you press 'send' on a message, an extraordinarily sophisticated system springs into action—routing your message through a global network of servers, ensuring delivery even when recipients are offline, maintaining end-to-end encryption that not even WhatsApp itself can break, and doing all of this in typically under one second.

Designing such a system represents one of the most challenging problems in distributed systems engineering. Unlike many applications where 'good enough' performance suffices, messaging systems must satisfy stringent requirements across multiple dimensions simultaneously: real-time latency for interactive conversations, perfect reliability where losing even one message is unacceptable, massive scalability to handle billions of concurrent users, and uncompromising security to protect private communications.

This module will take you through the complete system design of a WhatsApp-like messaging platform. We begin where every good system design must: with a rigorous, exhaustive analysis of requirements.

What You Will Master

By completing this page, you will understand how to systematically decompose messaging system requirements into functional capabilities, non-functional constraints, and scale parameters. You'll learn the critical questions that separate naive implementations from production-ready architectures, and you'll develop the analytical framework for approaching any real-time communication system design.

Understanding the Problem Space

Before diving into requirements, we must understand what makes messaging systems fundamentally different from other distributed applications. A messaging system is not simply a database with a UI—it's a real-time communication fabric that must maintain the illusion of instantaneous, reliable delivery across billions of devices connected through unreliable networks.

The core challenges unique to messaging:

Fundamental Challenges of Messaging Systems

•Bidirectional Real-Time Communication — Unlike HTTP's request-response model, messaging requires servers to push data to clients instantly. Each user is simultaneously a sender and receiver, creating complex state synchronization problems.
•Unreliable Client Connectivity — Mobile devices constantly switch between WiFi and cellular, enter tunnels, lose battery, or get rebooted. The system must handle disconnections gracefully without losing messages.
•Perfect Delivery Guarantees — Users expect every message to arrive, in order, exactly once. This 'exactly-once delivery' semantic is notoriously difficult in distributed systems.
•Varying Message Types — Text, images, videos, voice notes, locations, contacts, and documents all have different storage, delivery, and display requirements.
•Group Dynamics — Group chats introduce O(n) fan-out problems where a single message must reach hundreds of recipients, each with different online states.
•Security and Privacy — End-to-end encryption must protect messages from everyone, including the service provider, while still enabling features like multi-device sync.

System Design Interview Context

When tackling this problem in an interview, spend the first 5-7 minutes deeply exploring these challenges with your interviewer. This demonstrates systems thinking and ensures you're solving the right problem. The questions you ask reveal far more about your expertise than the solutions you provide.

Core Functional Requirements

Functional requirements define what the system must do. For a messaging system, we must be exhaustive—missing a core requirement leads to fundamental architectural flaws. Let's systematically enumerate every capability our messaging platform must provide.

2.1 One-to-One (1:1) Messaging

Direct messaging between two users forms the foundation of any messaging platform. Let's decompose this seemingly simple feature:

1:1 Messaging Functional Requirements
Requirement	Description	Complexity
Send text message	User A can send a text message (up to ~65,000 characters) to User B	Low
Receive text message	User B receives the message in real-time if online, or upon reconnection if offline	Medium
Message ordering	Messages in a conversation appear in the order they were sent, even across network partitions	High
Send media	Users can send images (up to 16MB compressed), videos (up to 2GB), documents, voice notes (up to 15 min), and location data	High
Message acknowledgment	System provides sent, delivered, and read receipts with timestamps	Medium
Message history	Users can scroll back through unlimited message history, loading older messages on demand	Medium
Delete for me/everyone	Users can delete messages locally or retract them for all participants within a time window	Medium
Edit messages	Users can edit sent messages within a time window, with edit history visible	Medium
Reply to specific message	Users can quote and reply to specific messages, maintaining threaded context	Low
Forward messages	Users can forward messages to other chats, with forwarding metadata preserved	Low

2.2 Group Messaging

Group chats introduce exponentially more complexity than 1:1 messaging. What seems like 'just sending to multiple people' creates intricate consistency and delivery challenges:

Group Messaging Functional Requirements
Requirement	Description	Complexity
Create group	Any user can create a group with a name, icon, and description, inviting initial members	Low
Add members	Admins can add new members; system must sync full message history or from join point	High
Remove members	Admins can remove members; removed users lose access but retain prior messages locally	Medium
Leave group	Any member can leave; remaining members are notified	Low
Admin hierarchy	Groups support multiple admin levels with different permissions (add/remove members, change settings, etc.)	Medium
Group size limits	Support up to 1,024 members per group (WhatsApp's current limit), with graceful handling of fan-out	High
Message delivery to all	Every message must reach every group member, with appropriate retry logic for offline members	Very High
Consistent group membership view	All members must have consistent view of who is in the group, despite network partitions	Very High
Group settings	Admins can restrict who can send messages, change group info, or add members	Low
Disappearing messages	Groups can enable auto-delete for messages after configured time periods	Medium

The Group Fan-Out Problem

A single message in a 1,000-member group requires 999 deliveries. With 100 million active groups and each member posting 1 message/day, that's 100 trillion deliveries daily. This fan-out multiplier is why group messaging dominates architecture decisions.

User and Account Management

While not the 'core' messaging feature, user management requirements significantly impact architecture. These requirements determine authentication flows, data storage patterns, and privacy mechanisms.

User Management Requirements

•Phone number registration — Users register with phone number (not email), verified via SMS or voice call OTP. This establishes identity and enables contact discovery.
•Contact synchronization — The app synchronizes the user's phone contacts with the server to discover which contacts are on the platform, using privacy-preserving techniques.
•Profile management — Users can set profile photo, name, status message, and 'last seen' visibility preferences. Profiles sync across all devices.
•Privacy controls — Granular controls for who can see profile photo, status, and last seen (everyone, contacts, specific contacts, nobody).
•Block/unblock users — Blocked users cannot send messages, see online status, or see profile updates. Blocking is silent (blocked users aren't notified).
•Account deletion — Users can permanently delete their account, which must cascade through all conversations, group memberships, and stored media.
•Multi-device support — Users can be logged into multiple devices simultaneously (phone + tablet + web + desktop), with all devices receiving messages in sync.
•Device linking — Secondary devices are linked to the primary phone through secure key exchange, enabling E2E encryption across devices.

Contact Discovery: The Privacy Challenge

Contact discovery is a deceptively difficult problem. Users want to know which of their contacts are on the platform, but uploading your entire contact list to a server raises serious privacy concerns.

Approaches to contact discovery:

Contact Discovery Approaches
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Approach 1: Plain Upload (Privacy-Poor)
────────────────────────────────────────
- Client uploads contacts: [+1234567890, +0987654321, ...]
- Server returns registered matches
- Problem: Server knows your entire social graph
 
Approach 2: Hash-Based Matching (Better, but flawed)
────────────────────────────────────────────────────
- Client hashes contacts: [SHA256(+1234567890), ...]
- Server matches against registered user hashes
- Problem: Phone numbers have low entropy (~10^10 possible numbers)
         Server can precompute all hashes and reverse them
 
Approach 3: Private Set Intersection (Privacy-Preserving)
────────────────────────────────────────────────────────
- Use cryptographic protocols where:
  • Client learns ONLY which contacts are registered
  • Server learns NOTHING about unregistered contacts
- Examples: PSI using Diffie-Hellman or homomorphic encryption
- Trade-off: Higher computational cost, but mathematically private

Interview Insight

Mentioning Private Set Intersection for contact discovery in an interview demonstrates security awareness beyond the obvious. It shows you've thought about privacy implications that many engineers overlook.

Scale Requirements Analysis

Understanding scale is critical because it determines nearly every architectural decision. A system for 1,000 users looks nothing like one for 1 billion. Let's establish WhatsApp-scale numbers and derive the implications:

WhatsApp Scale Parameters (2024 Estimates)
Metric	Value	Design Implication
Monthly Active Users (MAU)	2 billion	Massive user database sharding required
Daily Active Users (DAU)	1.4 billion	~16,000 concurrent users PER SECOND just coming online
Messages per day	100+ billion	~1.2 million messages PER SECOND sustained
Average messages per user per day	~70	Inbox storage grows quickly; need efficient storage
Media messages (% of total)	~25%	Massive blob storage requirements (petabytes)
Average group size	~8 members	Fan-out factor for group messages
Peak multiplier	~3x average	Must handle 3.6 million msgs/sec at peaks
Geographic distribution	200+ countries	Latency-sensitive; need global presence
Average message size (text)	~100 bytes	100 billion × 100 bytes = 10 TB/day of text alone
Average media size	~500 KB	25 billion × 500 KB = 12.5 PB/day of media

4.1 Derived Throughput Requirements

From these base metrics, we can calculate the system's throughput requirements:

Throughput Calculations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
MESSAGE THROUGHPUT
══════════════════
Daily messages:     100 billion
Seconds per day:    86,400
Average rate:       100B / 86,400 ≈ 1.16 million msgs/second
Peak rate (3x):     ~3.5 million msgs/second
 
CONNECTION THROUGHPUT
═════════════════════
DAU:                1.4 billion users
Average online time: ~4 hours/day (estimate)
Concurrent users:    1.4B × (4/24) ≈ 233 million concurrent
At any second:       ~233 million WebSocket connections
 
With graceful degradation, need capacity for:
- 300+ million simultaneous WebSocket connections
- ~3.5 million message deliveries/second
- ~500,000 media uploads/second
 
STORAGE THROUGHPUT
══════════════════
Text messages:      10 TB/day → ~120 MB/second sustained write
Media files:        12.5 PB/day → ~150 GB/second sustained write
Total daily ingest: ~12.5 PB (massive object storage requirement)

The Numbers Tell the Story

These calculations reveal why WhatsApp famously ran on just 50 engineers for years—not because the problem is simple, but because they made brilliant architectural choices. Each architectural decision must be evaluated against these numbers. A solution that adds 1ms latency per message means 3,500 extra seconds of server time per second at peak load.

Latency Requirements

Messaging latency directly impacts user experience. Unlike web page loads where 1-2 seconds is acceptable, chat messages must feel instantaneous. Users subconsciously expect the same responsiveness as in-person conversation.

Latency Targets by Operation
Operation	Target P99 Latency	Rationale
Send message (optimistic UI)	< 50ms	Message should appear 'sent' immediately locally
Message delivery to online recipient	< 300ms	Feels real-time; enables natural conversation flow
Message delivery to recently offline user	< 2 seconds	Upon reconnection, catch-up should be fast
Group message fan-out (1000 members)	< 1 second	All online members should receive 'simultaneously'
Media upload start (to first byte acknowledged)	< 500ms	User knows upload has begun
Typing indicator propagation	< 200ms	Must feel real-time or it's useless
Read receipt delivery	< 500ms	Should appear before user looks away
Contact sync (initial)	< 10 seconds	First-time sync with 1000+ contacts
Message history load (page of 20)	< 300ms	Scrolling back should feel smooth
Search results	< 500ms	Searching through years of messages

Global Latency Considerations

With users in 200+ countries, network Round-Trip Time (RTT) varies dramatically:

Same city: 5-20ms RTT
Same continent: 30-80ms RTT
Cross-Pacific: 150-200ms RTT
User with poor connection: 300-1000+ms RTT

Architectural implication: To achieve <300ms message delivery globally, the system cannot depend on more than 1-2 round trips. This rules out architectures requiring:

Synchronous database commits to a distant primary
Multiple sequential API calls
Complex consensus protocols for each message

Edge presence is mandatory. Users must connect to nearby data centers, and message routing must be optimized for geographic proximity when sender and receiver are far apart.

Latency Budget Breakdown

For a 300ms end-to-end delivery target with sender in Brazil and recipient in Japan: ~150ms is just network RTT to the nearest edge server. You have only ~150ms left for all processing: authentication, message queuing, routing, recipient lookup, and push delivery. Every millisecond counts.

Reliability Requirements

Reliability in messaging is non-negotiable. Users will forgive occasional slowness, but they will not forgive lost messages. A messaging system that loses even 0.001% of messages would lose 1 million messages per day at WhatsApp scale—utterly unacceptable.

Reliability Requirements Breakdown

•Message durability: Once a user sees 'single checkmark' (sent), the message MUST be delivered eventually. 99.999% durability minimum.
•Ordering guarantees: Messages in a conversation must be delivered in send order. Causal ordering must be preserved even during network partitions.
•Exactly-once delivery: Each message should be delivered to each recipient exactly once. Duplicates are annoying; missing messages are unforgivable.
•Device synchronization: All linked devices must eventually have the same message history. Conflicts must resolve deterministically.
•Failure transparency: When delivery fails, users must be clearly informed and given retry options. Silent failures are never acceptable.
•Idempotency: Server APIs must handle retries gracefully. Mobile networks cause duplicate requests; these must not cause duplicate messages.
•Crash recovery: If server crashes mid-operation, no messages should be lost or corrupted. Write-ahead logging and checkpoints are essential.
•Geographic resilience: Entire data centers can fail. Messages must survive regional outages with automatic failover.

The Exactly-Once Delivery Paradox

'Exactly-once delivery' is an impossible guarantee in distributed systems with arbitrary failures. What we actually implement is 'effectively-once' through idempotency:

The realistic guarantee:

We may attempt delivery multiple times (at-least-once)
Each delivery attempt includes a unique message ID
Recipients deduplicate based on message ID
The appearance to users is exactly-once, even if internally it's at-least-once + deduplication

This distinction matters architecturally. We must:

Store message IDs for deduplication (with TTL to prevent infinite storage growth)
Make delivery operations idempotent at every layer
Handle the case where the sender crashes after server receives message but before receiving acknowledgment

The Two Generals Problem

The 'Two Generals Problem' proves that reliable delivery over unreliable channels is impossible with finite messages. Our engineering solution is to make this impossibility invisible to users through persistent storage, retries, and idempotent delivery. Understanding this fundamental limitation helps you explain why 100% reliability is theoretically impossible but practically achievable.

Availability Requirements

Availability measures the percentage of time the system is operational. For a messaging platform that users depend on for daily communication—including emergency situations—availability requirements are exceptionally stringent.

Availability Targets and Their Implications
Availability Level	Downtime/Year	Assessment
99% (two nines)	3.65 days	Unacceptable for messaging
99.9% (three nines)	8.76 hours	Still too much for critical communication
99.99% (four nines)	52.6 minutes	Minimum acceptable; typical industry target
99.999% (five nines)	5.26 minutes	Aspirational target for core messaging

Graceful Degradation Hierarchy

Achieving 99.99%+ availability requires a carefully designed degradation strategy. Not all features are equally critical:

Tier 1 (Must never fail):

1:1 text message sending/receiving
Message storage and retrieval
Basic authentication

Tier 2 (Brief outages acceptable):

Group messaging
Media upload/download
Read receipts and typing indicators

Tier 3 (Can degrade significantly):

Contact synchronization
Profile updates
New user registration
Search across message history

During overload or partial failures, the system should shed Tier 3 functionality first, then Tier 2, protecting Tier 1 at all costs. This requires request classification, priority queuing, and circuit breakers at every service boundary.

The CAP Theorem Trade-off

When network partitions occur, we must choose between Consistency and Availability. For messaging, we typically favor Availability: it's better for users to send messages that might temporarily appear out-of-order than to completely block messaging. We accept eventual consistency for ordering and use conflict resolution strategies rather than blocking writes.

Security Requirements

Private messaging carries intimate conversations, financial information, personal photos, and sensitive business communications. Security is not a feature—it's a foundational requirement that shapes every architectural layer.

Security Requirements

•End-to-End Encryption (E2EE): Messages must be encrypted such that only sender and recipient(s) can read them. The service provider must not have access to message content, even if compelled by law enforcement.
•Key verification: Users must be able to verify encryption keys through security codes or QR code scanning to detect man-in-the-middle attacks.
•Forward secrecy: Compromise of current keys must not allow decryption of past messages. Each message uses ephemeral keys derived from long-term identity keys.
•Device authentication: Only authorized devices should receive messages. New device linking requires verification from primary device.
•Transport security: All client-server communication must use TLS 1.3 with certificate pinning to prevent interception.
•Server-side security: Even metadata (who talks to whom, when) should be minimized. Consider techniques like sealed sender for metadata protection.
•Spam and abuse prevention: Despite E2EE, system must prevent spam, harassment, and illegal content distribution. This creates tension with privacy—balance through client-side reporting.
•Account takeover prevention: Protect against SIM swapping, phishing, and social engineering attacks on accounts.

The Signal Protocol: Foundation of Modern Secure Messaging

WhatsApp, Facebook Messenger (secret conversations), Google Messages, and others all use the Signal Protocol (or derivatives) for E2EE. Understanding its key concepts is essential:

Core components:

Extended Triple Diffie-Hellman (X3DH): Establishes initial shared secret between two parties, even if one is offline
Double Ratchet Algorithm: Evolves encryption keys after each message, providing forward secrecy and break-in recovery
Prekeys: Allow initiating encrypted communication with offline users through server-stored public key bundles
Session state: Each conversation has ratchet state that must be synchronized across devices

The protocol's elegance is that it provides:

Confidentiality (only parties can read messages)
Forward secrecy (past messages safe even if keys leak)
Break-in recovery (future messages safe even if keys leak)
Deniability (no cryptographic proof who sent a message)

E2EE Complicates Everything

E2EE creates significant architectural complexity: you cannot index message content server-side, cannot implement server-side search, cannot filter content automatically, and cannot recover messages if users lose their keys. Every feature must be reimagined through the lens of 'the server cannot see message content.'

Requirements Prioritization: The MVP

With exhaustive requirements defined, we must prioritize. In a system design interview, you cannot design everything—and even in production, you cannot build everything at once. Let's establish the Minimum Viable Product (MVP) for a WhatsApp-like system:

MVP Scope (Must Have)

•User registration with phone verification
•1:1 text messaging with delivery confirmation
•Real-time message push to online users
•Offline message storage and delivery on reconnect
•Basic message ordering within conversations
•Group creation and messaging (up to 256 members)
•User presence (online/last seen)
•Read receipts (optional per user)
•End-to-end encryption for all messages
•Basic contact discovery

Phase 2 (Important, Deferrable)

•Image and video sharing
•Voice notes
•Large group support (1000+ members)
•Multi-device support
•Message editing and deletion
•Message search
•Disappearing messages
•Status/Stories feature
•Voice and video calling
•Business/API integrations

Interview Prioritization Strategy

For a 45-minute system design interview, focus on:

Minutes 1-5: Clarify requirements, establish MVP scope
Minutes 5-10: Estimate scale, derive system parameters
Minutes 10-30: Design core architecture (1:1 messaging, delivery, storage)
Minutes 30-40: Deep dive on one complex area (likely group messaging or E2EE)
Minutes 40-45: Touch on additional concerns (monitoring, security, scaling)

Explicitly state what you're NOT designing in detail. A clear scope demonstrates senior-level judgment.

The Power of Saying 'No'

In interviews, confidently deferring complex features shows maturity: 'Voice/video calling is a substantial separate system—I'll focus on text messaging today and can discuss calling architecture at a high level if time permits.' Interviewers respect this more than a superficial treatment of everything.

Summary and Key Takeaways

Requirements analysis for a messaging system is far more complex than it first appears. We've established the foundation for all subsequent architectural decisions.

Key Takeaways

•Messaging is fundamentally different from request-response systems—it requires real-time bidirectional communication, perfect reliability, and graceful offline handling.
•Scale drives architecture — 100 billion daily messages, 2 billion users, and 300 million concurrent connections demand extreme optimization at every layer.
•Group messaging introduces O(n) complexity — A single message to a 1000-member group requires 999 deliveries, fundamentally changing delivery architecture.
•Latency requirements are stringent — <300ms delivery for real-time feel, with the first 150ms often consumed by network RTT alone.
•Reliability must be effectively 100% — Losing even 0.001% of messages means 1 million lost daily. 'Exactly-once' is achieved through idempotency.
•E2EE is non-negotiable — The Signal Protocol provides confidentiality, forward secrecy, and deniability, but complicates nearly every feature.
•MVP scoping is essential — In interviews and reality, explicitly define what you will and won't design to demonstrate senior judgment.

What's next:

With requirements firmly established, we'll explore message delivery guarantees in depth—how to ensure every message reaches its destination exactly once, even across unreliable networks, device failures, and server crashes. We'll examine acknowledgment flows, retry strategies, and the storage patterns that make reliable delivery possible.

Page Complete

You now have a comprehensive framework for analyzing messaging system requirements. This systematic approach—decomposing functional needs, quantifying scale, and establishing non-functional constraints—applies to any system design problem you'll encounter.

1 / 6

Loading learning content...

System Design (HLD)WhatsApp Messaging

WhatsApp Messaging System Design

LevelAdvanced

Duration120 mins

TopicWhatsApp Messaging

1 / 6

Requirements: 1:1 and Group Messaging

Designing Communication at Planetary Scale

This module will take you through the complete system design of a WhatsApp-like messaging platform. We begin where every good system design must: with a rigorous, exhaustive analysis of requirements.

What You Will Master

Understanding the Problem Space

The core challenges unique to messaging:

Fundamental Challenges of Messaging Systems

•Bidirectional Real-Time Communication — Unlike HTTP's request-response model, messaging requires servers to push data to clients instantly. Each user is simultaneously a sender and receiver, creating complex state synchronization problems.
•Unreliable Client Connectivity — Mobile devices constantly switch between WiFi and cellular, enter tunnels, lose battery, or get rebooted. The system must handle disconnections gracefully without losing messages.
•Perfect Delivery Guarantees — Users expect every message to arrive, in order, exactly once. This 'exactly-once delivery' semantic is notoriously difficult in distributed systems.
•Varying Message Types — Text, images, videos, voice notes, locations, contacts, and documents all have different storage, delivery, and display requirements.
•Group Dynamics — Group chats introduce O(n) fan-out problems where a single message must reach hundreds of recipients, each with different online states.
•Security and Privacy — End-to-end encryption must protect messages from everyone, including the service provider, while still enabling features like multi-device sync.

System Design Interview Context

Core Functional Requirements

2.1 One-to-One (1:1) Messaging

Direct messaging between two users forms the foundation of any messaging platform. Let's decompose this seemingly simple feature:

1:1 Messaging Functional Requirements
Requirement	Description	Complexity
Send text message	User A can send a text message (up to ~65,000 characters) to User B	Low
Receive text message	User B receives the message in real-time if online, or upon reconnection if offline	Medium
Message ordering	Messages in a conversation appear in the order they were sent, even across network partitions	High
Send media	Users can send images (up to 16MB compressed), videos (up to 2GB), documents, voice notes (up to 15 min), and location data	High
Message acknowledgment	System provides sent, delivered, and read receipts with timestamps	Medium
Message history	Users can scroll back through unlimited message history, loading older messages on demand	Medium
Delete for me/everyone	Users can delete messages locally or retract them for all participants within a time window	Medium
Edit messages	Users can edit sent messages within a time window, with edit history visible	Medium
Reply to specific message	Users can quote and reply to specific messages, maintaining threaded context	Low
Forward messages	Users can forward messages to other chats, with forwarding metadata preserved	Low

2.2 Group Messaging

Group chats introduce exponentially more complexity than 1:1 messaging. What seems like 'just sending to multiple people' creates intricate consistency and delivery challenges:

Group Messaging Functional Requirements
Requirement	Description	Complexity
Create group	Any user can create a group with a name, icon, and description, inviting initial members	Low
Add members	Admins can add new members; system must sync full message history or from join point	High
Remove members	Admins can remove members; removed users lose access but retain prior messages locally	Medium
Leave group	Any member can leave; remaining members are notified	Low
Admin hierarchy	Groups support multiple admin levels with different permissions (add/remove members, change settings, etc.)	Medium
Group size limits	Support up to 1,024 members per group (WhatsApp's current limit), with graceful handling of fan-out	High
Message delivery to all	Every message must reach every group member, with appropriate retry logic for offline members	Very High
Consistent group membership view	All members must have consistent view of who is in the group, despite network partitions	Very High
Group settings	Admins can restrict who can send messages, change group info, or add members	Low
Disappearing messages	Groups can enable auto-delete for messages after configured time periods	Medium

The Group Fan-Out Problem

User and Account Management

User Management Requirements

•Phone number registration — Users register with phone number (not email), verified via SMS or voice call OTP. This establishes identity and enables contact discovery.
•Contact synchronization — The app synchronizes the user's phone contacts with the server to discover which contacts are on the platform, using privacy-preserving techniques.
•Profile management — Users can set profile photo, name, status message, and 'last seen' visibility preferences. Profiles sync across all devices.
•Privacy controls — Granular controls for who can see profile photo, status, and last seen (everyone, contacts, specific contacts, nobody).
•Block/unblock users — Blocked users cannot send messages, see online status, or see profile updates. Blocking is silent (blocked users aren't notified).
•Account deletion — Users can permanently delete their account, which must cascade through all conversations, group memberships, and stored media.
•Multi-device support — Users can be logged into multiple devices simultaneously (phone + tablet + web + desktop), with all devices receiving messages in sync.
•Device linking — Secondary devices are linked to the primary phone through secure key exchange, enabling E2E encryption across devices.

Contact Discovery: The Privacy Challenge

Approaches to contact discovery:

Contact Discovery Approaches
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Approach 1: Plain Upload (Privacy-Poor)
────────────────────────────────────────
- Client uploads contacts: [+1234567890, +0987654321, ...]
- Server returns registered matches
- Problem: Server knows your entire social graph
 
Approach 2: Hash-Based Matching (Better, but flawed)
────────────────────────────────────────────────────
- Client hashes contacts: [SHA256(+1234567890), ...]
- Server matches against registered user hashes
- Problem: Phone numbers have low entropy (~10^10 possible numbers)
         Server can precompute all hashes and reverse them
 
Approach 3: Private Set Intersection (Privacy-Preserving)
────────────────────────────────────────────────────────
- Use cryptographic protocols where:
  • Client learns ONLY which contacts are registered
  • Server learns NOTHING about unregistered contacts
- Examples: PSI using Diffie-Hellman or homomorphic encryption
- Trade-off: Higher computational cost, but mathematically private

Interview Insight

Scale Requirements Analysis

WhatsApp Scale Parameters (2024 Estimates)
Metric	Value	Design Implication
Monthly Active Users (MAU)	2 billion	Massive user database sharding required
Daily Active Users (DAU)	1.4 billion	~16,000 concurrent users PER SECOND just coming online
Messages per day	100+ billion	~1.2 million messages PER SECOND sustained
Average messages per user per day	~70	Inbox storage grows quickly; need efficient storage
Media messages (% of total)	~25%	Massive blob storage requirements (petabytes)
Average group size	~8 members	Fan-out factor for group messages
Peak multiplier	~3x average	Must handle 3.6 million msgs/sec at peaks
Geographic distribution	200+ countries	Latency-sensitive; need global presence
Average message size (text)	~100 bytes	100 billion × 100 bytes = 10 TB/day of text alone
Average media size	~500 KB	25 billion × 500 KB = 12.5 PB/day of media

4.1 Derived Throughput Requirements

From these base metrics, we can calculate the system's throughput requirements:

Throughput Calculations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
MESSAGE THROUGHPUT
══════════════════
Daily messages:     100 billion
Seconds per day:    86,400
Average rate:       100B / 86,400 ≈ 1.16 million msgs/second
Peak rate (3x):     ~3.5 million msgs/second
 
CONNECTION THROUGHPUT
═════════════════════
DAU:                1.4 billion users
Average online time: ~4 hours/day (estimate)
Concurrent users:    1.4B × (4/24) ≈ 233 million concurrent
At any second:       ~233 million WebSocket connections
 
With graceful degradation, need capacity for:
- 300+ million simultaneous WebSocket connections
- ~3.5 million message deliveries/second
- ~500,000 media uploads/second
 
STORAGE THROUGHPUT
══════════════════
Text messages:      10 TB/day → ~120 MB/second sustained write
Media files:        12.5 PB/day → ~150 GB/second sustained write
Total daily ingest: ~12.5 PB (massive object storage requirement)

The Numbers Tell the Story

Latency Requirements

Latency Targets by Operation
Operation	Target P99 Latency	Rationale
Send message (optimistic UI)	< 50ms	Message should appear 'sent' immediately locally
Message delivery to online recipient	< 300ms	Feels real-time; enables natural conversation flow
Message delivery to recently offline user	< 2 seconds	Upon reconnection, catch-up should be fast
Group message fan-out (1000 members)	< 1 second	All online members should receive 'simultaneously'
Media upload start (to first byte acknowledged)	< 500ms	User knows upload has begun
Typing indicator propagation	< 200ms	Must feel real-time or it's useless
Read receipt delivery	< 500ms	Should appear before user looks away
Contact sync (initial)	< 10 seconds	First-time sync with 1000+ contacts
Message history load (page of 20)	< 300ms	Scrolling back should feel smooth
Search results	< 500ms	Searching through years of messages

Global Latency Considerations

With users in 200+ countries, network Round-Trip Time (RTT) varies dramatically:

Same city: 5-20ms RTT
Same continent: 30-80ms RTT
Cross-Pacific: 150-200ms RTT
User with poor connection: 300-1000+ms RTT

Architectural implication: To achieve <300ms message delivery globally, the system cannot depend on more than 1-2 round trips. This rules out architectures requiring:

Synchronous database commits to a distant primary
Multiple sequential API calls
Complex consensus protocols for each message

Edge presence is mandatory. Users must connect to nearby data centers, and message routing must be optimized for geographic proximity when sender and receiver are far apart.

Latency Budget Breakdown

Reliability Requirements

Reliability Requirements Breakdown

•Message durability: Once a user sees 'single checkmark' (sent), the message MUST be delivered eventually. 99.999% durability minimum.
•Ordering guarantees: Messages in a conversation must be delivered in send order. Causal ordering must be preserved even during network partitions.
•Exactly-once delivery: Each message should be delivered to each recipient exactly once. Duplicates are annoying; missing messages are unforgivable.
•Device synchronization: All linked devices must eventually have the same message history. Conflicts must resolve deterministically.
•Failure transparency: When delivery fails, users must be clearly informed and given retry options. Silent failures are never acceptable.
•Idempotency: Server APIs must handle retries gracefully. Mobile networks cause duplicate requests; these must not cause duplicate messages.
•Crash recovery: If server crashes mid-operation, no messages should be lost or corrupted. Write-ahead logging and checkpoints are essential.
•Geographic resilience: Entire data centers can fail. Messages must survive regional outages with automatic failover.

The Exactly-Once Delivery Paradox

'Exactly-once delivery' is an impossible guarantee in distributed systems with arbitrary failures. What we actually implement is 'effectively-once' through idempotency:

The realistic guarantee:

We may attempt delivery multiple times (at-least-once)
Each delivery attempt includes a unique message ID
Recipients deduplicate based on message ID
The appearance to users is exactly-once, even if internally it's at-least-once + deduplication

This distinction matters architecturally. We must:

Store message IDs for deduplication (with TTL to prevent infinite storage growth)
Make delivery operations idempotent at every layer
Handle the case where the sender crashes after server receives message but before receiving acknowledgment

The Two Generals Problem

Availability Requirements

Availability Targets and Their Implications
Availability Level	Downtime/Year	Assessment
99% (two nines)	3.65 days	Unacceptable for messaging
99.9% (three nines)	8.76 hours	Still too much for critical communication
99.99% (four nines)	52.6 minutes	Minimum acceptable; typical industry target
99.999% (five nines)	5.26 minutes	Aspirational target for core messaging

Graceful Degradation Hierarchy

Achieving 99.99%+ availability requires a carefully designed degradation strategy. Not all features are equally critical:

Tier 1 (Must never fail):

1:1 text message sending/receiving
Message storage and retrieval
Basic authentication

Tier 2 (Brief outages acceptable):

Group messaging
Media upload/download
Read receipts and typing indicators

Tier 3 (Can degrade significantly):

Contact synchronization
Profile updates
New user registration
Search across message history

The CAP Theorem Trade-off

Security Requirements

•End-to-End Encryption (E2EE): Messages must be encrypted such that only sender and recipient(s) can read them. The service provider must not have access to message content, even if compelled by law enforcement.
•Key verification: Users must be able to verify encryption keys through security codes or QR code scanning to detect man-in-the-middle attacks.
•Forward secrecy: Compromise of current keys must not allow decryption of past messages. Each message uses ephemeral keys derived from long-term identity keys.
•Device authentication: Only authorized devices should receive messages. New device linking requires verification from primary device.
•Transport security: All client-server communication must use TLS 1.3 with certificate pinning to prevent interception.
•Server-side security: Even metadata (who talks to whom, when) should be minimized. Consider techniques like sealed sender for metadata protection.
•Spam and abuse prevention: Despite E2EE, system must prevent spam, harassment, and illegal content distribution. This creates tension with privacy—balance through client-side reporting.
•Account takeover prevention: Protect against SIM swapping, phishing, and social engineering attacks on accounts.

The Signal Protocol: Foundation of Modern Secure Messaging

WhatsApp, Facebook Messenger (secret conversations), Google Messages, and others all use the Signal Protocol (or derivatives) for E2EE. Understanding its key concepts is essential:

Core components:

Extended Triple Diffie-Hellman (X3DH): Establishes initial shared secret between two parties, even if one is offline
Double Ratchet Algorithm: Evolves encryption keys after each message, providing forward secrecy and break-in recovery
Prekeys: Allow initiating encrypted communication with offline users through server-stored public key bundles
Session state: Each conversation has ratchet state that must be synchronized across devices

The protocol's elegance is that it provides:

Confidentiality (only parties can read messages)
Forward secrecy (past messages safe even if keys leak)
Break-in recovery (future messages safe even if keys leak)
Deniability (no cryptographic proof who sent a message)

E2EE Complicates Everything

Requirements Prioritization: The MVP

MVP Scope (Must Have)

•User registration with phone verification
•1:1 text messaging with delivery confirmation
•Real-time message push to online users
•Offline message storage and delivery on reconnect
•Basic message ordering within conversations
•Group creation and messaging (up to 256 members)
•User presence (online/last seen)
•Read receipts (optional per user)
•End-to-end encryption for all messages
•Basic contact discovery

Phase 2 (Important, Deferrable)

•Image and video sharing
•Voice notes
•Large group support (1000+ members)
•Multi-device support
•Message editing and deletion
•Message search
•Disappearing messages
•Status/Stories feature
•Voice and video calling
•Business/API integrations

Interview Prioritization Strategy

For a 45-minute system design interview, focus on:

Minutes 1-5: Clarify requirements, establish MVP scope
Minutes 5-10: Estimate scale, derive system parameters
Minutes 10-30: Design core architecture (1:1 messaging, delivery, storage)
Minutes 30-40: Deep dive on one complex area (likely group messaging or E2EE)
Minutes 40-45: Touch on additional concerns (monitoring, security, scaling)

Explicitly state what you're NOT designing in detail. A clear scope demonstrates senior-level judgment.

The Power of Saying 'No'

Summary and Key Takeaways

Requirements analysis for a messaging system is far more complex than it first appears. We've established the foundation for all subsequent architectural decisions.

Key Takeaways

•Messaging is fundamentally different from request-response systems—it requires real-time bidirectional communication, perfect reliability, and graceful offline handling.
•Scale drives architecture — 100 billion daily messages, 2 billion users, and 300 million concurrent connections demand extreme optimization at every layer.
•Group messaging introduces O(n) complexity — A single message to a 1000-member group requires 999 deliveries, fundamentally changing delivery architecture.
•Latency requirements are stringent — <300ms delivery for real-time feel, with the first 150ms often consumed by network RTT alone.
•Reliability must be effectively 100% — Losing even 0.001% of messages means 1 million lost daily. 'Exactly-once' is achieved through idempotency.
•E2EE is non-negotiable — The Signal Protocol provides confidentiality, forward secrecy, and deniability, but complicates nearly every feature.
•MVP scoping is essential — In interviews and reality, explicitly define what you will and won't design to demonstrate senior judgment.

What's next:

Page Complete

1 / 6