System DesignDiscord Voice & Chat

Design Discord: Voice, Video & Chat Platform

LevelAdvanced

Duration90 mins

TopicDiscord Voice & Chat

2 / 5

Real-Time Communication Architecture

The Heartbeat of Real-Time

When you send a message on Discord, it appears for all channel members within 200 milliseconds—often less than 100ms. For perspective, that's faster than a human blink (300-400ms). This seemingly magical instant delivery requires one of the most sophisticated real-time infrastructures ever built.

The challenge isn't sending one message quickly—that's trivial. The challenge is maintaining 10+ million persistent connections simultaneously, any of which might need to receive a message at any moment, while handling 140,000 messages per second at peak, with global geographic distribution, and with zero message loss.

What You Will Learn

This page takes you deep into Discord's real-time messaging architecture. You'll understand WebSocket connection management at scale, the Gateway service design, message routing and fanout strategies, presence propagation, and the critical role of connection state in distributed systems.

The WebSocket Foundation

Real-time communication requires bidirectional, low-latency, persistent connections. HTTP's request-response model fails here—polling introduces latency and waste; long-polling has connection limits. WebSockets solve this problem.

What WebSocket provides:

Full-duplex communication: Both client and server can send data independently
Persistent connection: Single handshake, then continuous message flow
Low overhead: After handshake, just 2-14 bytes of framing per message
Browser-native: Supported in all modern browsers without plugins
TCP-based: Inherits TCP's reliability and ordering guarantees

WebSocket Lifecycle
1
2
3
4
5
6
7
8
9
10
11
GET /gateway HTTP/1.1
Host: gateway.discord.gg
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
Authorization: Bearer <user_token>
 
# Server responds with 101 Switching Protocols
# Connection is now a WebSocket
# All subsequent communication uses WebSocket frames

The heartbeat protocol:

To detect dead connections (user closed laptop, network went down without TCP FIN), Discord implements a heartbeat mechanism:

Server sends HELLO with heartbeat_interval (typically ~41 seconds)
Client must send HEARTBEAT (opcode 1) within that interval
Server responds with HEARTBEAT_ACK (opcode 11)
If server misses ACKing, client reconnects
If client misses sending heartbeat, server closes connection

This bidirectional heartbeat ensures both parties detect connection death within ~45 seconds, crucial for accurate presence tracking.

Why 41.25 seconds?

The heartbeat interval is tuned to balance connection health detection against network overhead. Too short: excessive traffic for millions of connections. Too long: stale presence data. 41.25 seconds with jitter prevents thundering herd problems where all clients heartbeat simultaneously.

The Gateway Service

The Gateway is Discord's most critical service—the edge layer that maintains all WebSocket connections. Every connected client talks to a Gateway server, which then routes messages to/from backend services.

Gateway responsibilities:

Connection Management: Accept, authenticate, and maintain WebSocket connections
Protocol Translation: Convert internal events to client-facing payloads
Session Tracking: Know which user is on which server, which channels visible
Event Routing: Deliver events only to interested/permitted clients
Rate Limiting: Protect against abusive clients
Compression: zlib compression for bandwidth reduction

Converting Mermaid diagram...

Connection-to-session mapping:

Each Gateway server maintains an in-memory map of:

User ID → Connection: Which socket belongs to which user
Guild (Server) ID → User IDs: Which connected users are in which guilds
Channel ID → User IDs: Which users are watching which channels

This enables efficient routing: when a message is sent to channel X, the Gateway knows immediately which local connections need to receive it.

Gateway Event Types (Partial)
Opcode	Name	Direction	Description
0	Dispatch	Server → Client	Event dispatched to client (MESSAGE_CREATE, etc.)
1	Heartbeat	Both	Keepalive ping
2	Identify	Client → Server	Initial authentication
3	Presence Update	Client → Server	Update online status
4	Voice State Update	Client → Server	Join/leave voice channel
6	Resume	Client → Server	Reconnect and replay missed events
7	Reconnect	Server → Client	Server requests client reconnect
9	Invalid Session	Server → Client	Session is invalid, re-identify
10	Hello	Server → Client	Initial handshake with heartbeat interval
11	Heartbeat ACK	Server → Client	Heartbeat acknowledged

Scaling Connection State

Each Gateway server can handle approximately 100,000-150,000 concurrent WebSocket connections—limited by memory, file descriptors, and CPU for serialization. At 10 million concurrent users, you need 70-100 Gateway servers.

But here's the challenge: when a message is sent to channel X, how do you know which Gateway servers have users watching that channel?

Naive Approach: Broadcast Everything

•Every event sent to every Gateway
•Each Gateway filters for local connections
•Problem: 140K events/sec × 100 servers = 14M events/sec per Gateway
•Problem: Wastes bandwidth, CPU on irrelevant events
•Problem: Doesn't scale as Gateways increase

Smart Approach: Pub/Sub with Topics

•Each Gateway subscribes to relevant topics
•Topics organized by Guild/Channel
•Events routed only to subscribing Gateways
•Benefit: Each Gateway receives only relevant events
•Benefit: Scales horizontally with more Gateways

The subscription model:

When a user's WebSocket connects and identifies:

Gateway loads user's guild memberships from database/cache
For each guild, Gateway subscribes to guild:{guild_id} topic
For channels the user can see, subscribe to channel:{channel_id} topics
When user disconnects, unsubscribe from all topics

Now when a message is sent:

Message Service publishes to channel:{channel_id} topic
Only Gateways with subscribed users receive the event
Each receiving Gateway delivers to appropriate local connections

Gateway Subscription Logic
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// When user identifies successfully
func (g *Gateway) onIdentify(conn *Connection, user *User) {
    // Get all guilds this user belongs to
    guilds, _ := g.guildService.GetUserGuilds(user.ID)
    
    for _, guild := range guilds {
        // Subscribe to guild-wide events (member updates, etc.)
        g.pubsub.Subscribe(fmt.Sprintf("guild:%s", guild.ID))
        
        // Get channels visible to this user in this guild
        channels := g.getVisibleChannels(guild.ID, user.ID)
        for _, channel := range channels {
            // Subscribe to channel-specific events (messages, typing)
            g.pubsub.Subscribe(fmt.Sprintf("channel:%s", channel.ID))
        }
    }
    
    // Track locally which user is on which connection
    g.userConnections[user.ID] = conn
}
 
// When receiving pub/sub message
func (g *Gateway) onPubSubMessage(topic string, payload []byte) {
    // Find all local connections interested in this topic
    connections := g.subscriptionMap.GetConnections(topic)
    
    for _, conn := range connections {
        // Apply per-user filtering (permissions might differ)
        if g.canUserSeeEvent(conn.UserID, payload) {
            conn.Send(payload)
        }
    }
}

The Large Guild Problem

What about guilds with 500K members? If a message is sent, and 100K of those members are online across 70 Gateways, every Gateway receives it. This is the 'hot channel' problem—solved by special handling for large guilds, explored in the scaling section.

Message Flow Deep Dive

Let's trace exactly what happens when you send a message in Discord, from keypress to delivery on recipients' screens.

Complete Message Flow

Sequence

1. CLIENT: User presses Enter
   └─ Sends HTTP POST to /api/channels/{id}/messages
   └─ Body: { "content": "Hello world!" }
 
2. API GATEWAY: Receives request
   └─ Validates authentication token
   └─ Rate limit check (5 msgs/5 sec per channel)
   └─ Routes to Message Service
 
3. MESSAGE SERVICE: Processes message
   └─ Validate user has SEND_MESSAGES permission in channel
   └─ Apply content filtering (banned words, spam detection)
   └─ Generate unique snowflake ID (timestamp-embedded)
   └─ Transaction: Write to primary database
   └─ Update channel's last_message_id
 
4. MESSAGE SERVICE: Trigger fanout
   └─ Publish MESSAGE_CREATE event to pub/sub
   └─ Topic: channel:{channel_id}
   └─ Payload includes full message object
 
5. PUB/SUB: Distributes to subscribed Gateways
   └─ ~10-20 Gateway servers typically have subscribers
   └─ Parallel delivery to all
 
6. GATEWAY SERVERS (each): Process event
   └─ Look up local connections subscribed to this channel
   └─ For each connection:
       └─ Check user can still see channel (permissions)
       └─ Serialize event to wire format (JSON, ETF)
       └─ Write to WebSocket connection
 
7. CLIENT: Receives MESSAGE_CREATE
   └─ Deserialize payload
   └─ Insert into local message cache
   └─ Render in UI
   └─ Update unread indicators
 
TOTAL LATENCY: 50-150ms typical
- Network Client→API: 10-40ms
- API processing: 5-15ms
- Database write: 5-20ms
- Pub/sub fanout: 5-15ms
- Gateway→Client: 10-40ms

Notice the HTTP/WebSocket Split

Messages are SENT via HTTP (reliable, can return errors) but RECEIVED via WebSocket (lowest latency). This hybrid approach gives the best of both worlds: reliable writes with real-time reads.

Typing indicators—optimized for latency:

Typing indicators follow a different path optimized purely for latency:

Client sends TYPING_START via WebSocket (not HTTP)
Gateway immediately forwards to pub/sub (no database)
Other Gateways receive and push to clients
Client shows 'User is typing...' for 10 seconds
Each keystroke extends the indicator

Typing indicators are ephemeral—they're never persisted, so they bypass the database entirely, achieving 30-50ms delivery.

Presence System Architecture

Presence—knowing who's online, idle, DND, or offline—seems simple but becomes incredibly complex at scale. Discord must track and propagate presence for millions of users, updating in real-time as users come online, go idle, or disconnect.

The presence challenge:

10M online users, each visible to average of 500 friends/guild-members
That's 5 billion presence relationship pairs to potentially update
Users go idle after 5 minutes, sleep after 10 minutes
Network disconnects happen constantly (mobile, laptops closing)
Must distinguish 'offline' from 'invisible' (a privacy feature)

Presence States
Status	Meaning	Determination
Online	User is active	Heartbeat received, recent activity
Idle	User is away	5+ minutes since last activity
Do Not Disturb	User set DND	Explicit user action
Invisible	Appears offline to others	User preference (stored)
Offline	Not connected	No active gateway connection
Streaming	User is streaming	Detected stream activity

Presence propagation strategy:

Discord doesn't broadcast every presence change to everyone. Instead:

Friend-based presence: Your presence only propagates to friends and guild members
Lazy loading: When you open a guild, presence for visible members is fetched on-demand
Guild sharding: Large guilds (>1000 online) use presence sampling—you see a subset
Batching updates: Multiple presence changes batched into single events (every 1-5 seconds)
Differential updates: Only changes are sent, not full presence state

Presence Update Event
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// PRESENCE_UPDATE event
{
  "op": 0,
  "t": "PRESENCE_UPDATE",
  "d": {
    "user": {
      "id": "123456789012345678"
    },
    "status": "online",        // online, idle, dnd, offline
    "activities": [
      {
        "name": "Visual Studio Code",
        "type": 0,             // 0=Game, 1=Streaming, 2=Listening, etc.
        "state": "Editing page-2.ts",
        "timestamps": {
          "start": 1704729600000
        }
      }
    ],
    "client_status": {
      "desktop": "online",
      "mobile": "idle",
      "web": null
    },
    "guild_id": "987654321098765432"  // Context for this update
  }
}

Presence is Eventually Consistent

Presence is designed for eventual consistency. If it takes 10 seconds for someone's status to update, that's acceptable. This relaxed consistency requirement allows significant optimization—presence updates can be batched, delayed, and even dropped during overload.

Session Resume and Event Consistency

Network connections fail constantly—WiFi handoffs, cellular dead zones, laptop sleep/wake cycles. Discord must handle disconnections gracefully without losing messages or requiring full state resync.

The Resume protocol:

When a client connects, it receives a session_id and tracks a sequence number for each event received. If the connection drops:

Client reconnects to Gateway
Sends RESUME with session_id and last known sequence
Gateway replays all events since that sequence number
Client continues as if never disconnected

Resume Protocol
Resume Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Client sends RESUME
{
  "op": 6,
  "d": {
    "token": "user_auth_token",
    "session_id": "abc123def456",
    "seq": 42  // Last event sequence received
  }
}
 
// If session still valid, server sends:
{
  "op": 0,
  "t": "RESUMED",
  "s": 42,  // Confirms sequence
  "d": {}
}
 
// Then replays missed events:
{ "op": 0, "t": "MESSAGE_CREATE", "s": 43, "d": {...} }
{ "op": 0, "t": "MESSAGE_CREATE", "s": 44, "d": {...} }
{ "op": 0, "t": "TYPING_START", "s": 45, "d": {...} }
 
// If session expired (>15-30 seconds):
{
  "op": 9,  // Invalid Session
  "d": false  // Cannot resume, must re-identify
}

How does the Gateway remember events to replay?

Each Gateway maintains a per-session event buffer:

When events are sent to a connection, they're also written to a circular buffer
Buffer stores last ~1000 events or ~60 seconds, whichever is smaller
Session→Buffer mapping kept in memory (and optionally replicated)
On resume, Gateway reads from buffer starting at client's last sequence

If the buffer has been exhausted (client was disconnected too long), the session is invalid and client must re-identify, receiving full state (which is expensive but necessary).

Resume is Memory Intensive

With 100K connections per Gateway and 1000 events per buffer, that's potentially 100M buffered events per Gateway. At ~500 bytes average, that's ~50GB per Gateway just for resume buffers. This is why buffers have strict limits and expire quickly.

Client-Side State Synchronization

The client maintains a local cache of Discord state—messages, channels, members, settings. This cache must stay synchronized with the server through the WebSocket connection.

Initial state load (READY event):

When a client identifies, the server sends a massive READY event containing:

User object (you)
User settings (theme, notifications)
DM channels
Guild list (basic info for each guild)
Relationships (friends, blocked users)
Presences (initial online friends)
Read states (unread count per channel)

For guilds, this is intentionally partial—full member lists for large guilds would be megabytes.

READY Event Structure
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
  "op": 0,
  "t": "READY",
  "s": 1,
  "d": {
    "v": 10,  // Gateway protocol version
    "user": {
      "id": "123456789012345678",
      "username": "user",
      "discriminator": "0",
      "avatar": "a_abc123"
    },
    "guilds": [
      // "Unavailable" guilds - just IDs, details sent via GUILD_CREATE
      { "id": "111", "unavailable": true },
      { "id": "222", "unavailable": true }
    ],
    "private_channels": [
      { "id": "333", "type": 1, "recipients": [...] }
    ],
    "session_id": "abc123def456",
    "resume_gateway_url": "wss://gateway-us-east1-b.discord.gg",
    "shard": [0, 1],  // For bots: shard ID, total shards
    "application": { "id": "...", "flags": ... }
  }
}

Lazy guild loading:

To keep READY fast, guilds are marked unavailable and detailed data arrives via separate GUILD_CREATE events. The client shows a loading state until each guild's data arrives.

Request Guild Members (lazy load):

When you click on a guild's member list:

Client sends REQUEST_GUILD_MEMBERS op
Server streams members in chunks of 1000
Client renders as chunks arrive
For very large guilds, only online members + search results are available

This lazy loading is essential—a user in 100 guilds with 1000 members each would otherwise need to load 100K member objects on startup.

Optimistic Updates

When you send a message, the client immediately displays it locally (optimistic update) before server confirmation. If the send fails, the message shows a 'failed to send' indicator. This makes the UI feel instant even with network latency.

Summary: Real-Time at Scale

We've explored the sophisticated infrastructure that powers Discord's real-time communication. Let's consolidate the key insights:

Key Takeaways

•WebSocket is the backbone: Persistent, bidirectional connections enable sub-200ms message delivery
•Gateway servers are stateful: Each maintains ~100K connections with in-memory subscription state
•Pub/Sub enables efficient routing: Events flow only to Gateways with interested subscribers, not broadcast everywhere
•Hybrid HTTP/WebSocket: Writes via HTTP for reliability, reads via WebSocket for speed
•Presence is eventually consistent: Relaxed consistency enables batching and optimization
•Resume protocol handles disconnects: Session buffers allow seamless reconnection without re-sync
•Lazy loading is essential: Full state would overwhelm clients; load on demand instead

What's next:

With the real-time text messaging foundation understood, we'll next explore Discord's server architecture—how backend services are organized, how data is stored and sharded, and how the API layer handles 100,000+ requests per second.

Page Complete

You now understand how Discord achieves real-time message delivery to millions of concurrent users. You've learned WebSocket lifecycle management, Gateway architecture, pub/sub event routing, presence propagation, and session resume handling. These patterns apply to any real-time communication system.

2 / 5

Loading learning content...

System DesignDiscord Voice & Chat

Design Discord: Voice, Video & Chat Platform

LevelAdvanced

Duration90 mins

TopicDiscord Voice & Chat

2 / 5

Real-Time Communication Architecture

The Heartbeat of Real-Time

What You Will Learn

The WebSocket Foundation

What WebSocket provides:

Full-duplex communication: Both client and server can send data independently
Persistent connection: Single handshake, then continuous message flow
Low overhead: After handshake, just 2-14 bytes of framing per message
Browser-native: Supported in all modern browsers without plugins
TCP-based: Inherits TCP's reliability and ordering guarantees

WebSocket Lifecycle
1
2
3
4
5
6
7
8
9
10
11
GET /gateway HTTP/1.1
Host: gateway.discord.gg
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
Authorization: Bearer <user_token>
 
# Server responds with 101 Switching Protocols
# Connection is now a WebSocket
# All subsequent communication uses WebSocket frames

The heartbeat protocol:

To detect dead connections (user closed laptop, network went down without TCP FIN), Discord implements a heartbeat mechanism:

Server sends HELLO with heartbeat_interval (typically ~41 seconds)
Client must send HEARTBEAT (opcode 1) within that interval
Server responds with HEARTBEAT_ACK (opcode 11)
If server misses ACKing, client reconnects
If client misses sending heartbeat, server closes connection

This bidirectional heartbeat ensures both parties detect connection death within ~45 seconds, crucial for accurate presence tracking.

Why 41.25 seconds?

The Gateway Service

Gateway responsibilities:

Connection Management: Accept, authenticate, and maintain WebSocket connections
Protocol Translation: Convert internal events to client-facing payloads
Session Tracking: Know which user is on which server, which channels visible
Event Routing: Deliver events only to interested/permitted clients
Rate Limiting: Protect against abusive clients
Compression: zlib compression for bandwidth reduction

Converting Mermaid diagram...

Connection-to-session mapping:

Each Gateway server maintains an in-memory map of:

User ID → Connection: Which socket belongs to which user
Guild (Server) ID → User IDs: Which connected users are in which guilds
Channel ID → User IDs: Which users are watching which channels

This enables efficient routing: when a message is sent to channel X, the Gateway knows immediately which local connections need to receive it.

Gateway Event Types (Partial)
Opcode	Name	Direction	Description
0	Dispatch	Server → Client	Event dispatched to client (MESSAGE_CREATE, etc.)
1	Heartbeat	Both	Keepalive ping
2	Identify	Client → Server	Initial authentication
3	Presence Update	Client → Server	Update online status
4	Voice State Update	Client → Server	Join/leave voice channel
6	Resume	Client → Server	Reconnect and replay missed events
7	Reconnect	Server → Client	Server requests client reconnect
9	Invalid Session	Server → Client	Session is invalid, re-identify
10	Hello	Server → Client	Initial handshake with heartbeat interval
11	Heartbeat ACK	Server → Client	Heartbeat acknowledged

Scaling Connection State

But here's the challenge: when a message is sent to channel X, how do you know which Gateway servers have users watching that channel?

Naive Approach: Broadcast Everything

•Every event sent to every Gateway
•Each Gateway filters for local connections
•Problem: 140K events/sec × 100 servers = 14M events/sec per Gateway
•Problem: Wastes bandwidth, CPU on irrelevant events
•Problem: Doesn't scale as Gateways increase

Smart Approach: Pub/Sub with Topics

•Each Gateway subscribes to relevant topics
•Topics organized by Guild/Channel
•Events routed only to subscribing Gateways
•Benefit: Each Gateway receives only relevant events
•Benefit: Scales horizontally with more Gateways

The subscription model:

When a user's WebSocket connects and identifies:

Gateway loads user's guild memberships from database/cache
For each guild, Gateway subscribes to guild:{guild_id} topic
For channels the user can see, subscribe to channel:{channel_id} topics
When user disconnects, unsubscribe from all topics

Now when a message is sent:

Message Service publishes to channel:{channel_id} topic
Only Gateways with subscribed users receive the event
Each receiving Gateway delivers to appropriate local connections

Gateway Subscription Logic
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// When user identifies successfully
func (g *Gateway) onIdentify(conn *Connection, user *User) {
    // Get all guilds this user belongs to
    guilds, _ := g.guildService.GetUserGuilds(user.ID)
    
    for _, guild := range guilds {
        // Subscribe to guild-wide events (member updates, etc.)
        g.pubsub.Subscribe(fmt.Sprintf("guild:%s", guild.ID))
        
        // Get channels visible to this user in this guild
        channels := g.getVisibleChannels(guild.ID, user.ID)
        for _, channel := range channels {
            // Subscribe to channel-specific events (messages, typing)
            g.pubsub.Subscribe(fmt.Sprintf("channel:%s", channel.ID))
        }
    }
    
    // Track locally which user is on which connection
    g.userConnections[user.ID] = conn
}
 
// When receiving pub/sub message
func (g *Gateway) onPubSubMessage(topic string, payload []byte) {
    // Find all local connections interested in this topic
    connections := g.subscriptionMap.GetConnections(topic)
    
    for _, conn := range connections {
        // Apply per-user filtering (permissions might differ)
        if g.canUserSeeEvent(conn.UserID, payload) {
            conn.Send(payload)
        }
    }
}

The Large Guild Problem

Message Flow Deep Dive

Let's trace exactly what happens when you send a message in Discord, from keypress to delivery on recipients' screens.

Complete Message Flow

Sequence

1. CLIENT: User presses Enter
   └─ Sends HTTP POST to /api/channels/{id}/messages
   └─ Body: { "content": "Hello world!" }
 
2. API GATEWAY: Receives request
   └─ Validates authentication token
   └─ Rate limit check (5 msgs/5 sec per channel)
   └─ Routes to Message Service
 
3. MESSAGE SERVICE: Processes message
   └─ Validate user has SEND_MESSAGES permission in channel
   └─ Apply content filtering (banned words, spam detection)
   └─ Generate unique snowflake ID (timestamp-embedded)
   └─ Transaction: Write to primary database
   └─ Update channel's last_message_id
 
4. MESSAGE SERVICE: Trigger fanout
   └─ Publish MESSAGE_CREATE event to pub/sub
   └─ Topic: channel:{channel_id}
   └─ Payload includes full message object
 
5. PUB/SUB: Distributes to subscribed Gateways
   └─ ~10-20 Gateway servers typically have subscribers
   └─ Parallel delivery to all
 
6. GATEWAY SERVERS (each): Process event
   └─ Look up local connections subscribed to this channel
   └─ For each connection:
       └─ Check user can still see channel (permissions)
       └─ Serialize event to wire format (JSON, ETF)
       └─ Write to WebSocket connection
 
7. CLIENT: Receives MESSAGE_CREATE
   └─ Deserialize payload
   └─ Insert into local message cache
   └─ Render in UI
   └─ Update unread indicators
 
TOTAL LATENCY: 50-150ms typical
- Network Client→API: 10-40ms
- API processing: 5-15ms
- Database write: 5-20ms
- Pub/sub fanout: 5-15ms
- Gateway→Client: 10-40ms

Notice the HTTP/WebSocket Split

Messages are SENT via HTTP (reliable, can return errors) but RECEIVED via WebSocket (lowest latency). This hybrid approach gives the best of both worlds: reliable writes with real-time reads.

Typing indicators—optimized for latency:

Typing indicators follow a different path optimized purely for latency:

Client sends TYPING_START via WebSocket (not HTTP)
Gateway immediately forwards to pub/sub (no database)
Other Gateways receive and push to clients
Client shows 'User is typing...' for 10 seconds
Each keystroke extends the indicator

Typing indicators are ephemeral—they're never persisted, so they bypass the database entirely, achieving 30-50ms delivery.

Presence System Architecture

The presence challenge:

10M online users, each visible to average of 500 friends/guild-members
That's 5 billion presence relationship pairs to potentially update
Users go idle after 5 minutes, sleep after 10 minutes
Network disconnects happen constantly (mobile, laptops closing)
Must distinguish 'offline' from 'invisible' (a privacy feature)

Presence States
Status	Meaning	Determination
Online	User is active	Heartbeat received, recent activity
Idle	User is away	5+ minutes since last activity
Do Not Disturb	User set DND	Explicit user action
Invisible	Appears offline to others	User preference (stored)
Offline	Not connected	No active gateway connection
Streaming	User is streaming	Detected stream activity

Presence propagation strategy:

Discord doesn't broadcast every presence change to everyone. Instead:

Friend-based presence: Your presence only propagates to friends and guild members
Lazy loading: When you open a guild, presence for visible members is fetched on-demand
Guild sharding: Large guilds (>1000 online) use presence sampling—you see a subset
Batching updates: Multiple presence changes batched into single events (every 1-5 seconds)
Differential updates: Only changes are sent, not full presence state

Presence Update Event
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// PRESENCE_UPDATE event
{
  "op": 0,
  "t": "PRESENCE_UPDATE",
  "d": {
    "user": {
      "id": "123456789012345678"
    },
    "status": "online",        // online, idle, dnd, offline
    "activities": [
      {
        "name": "Visual Studio Code",
        "type": 0,             // 0=Game, 1=Streaming, 2=Listening, etc.
        "state": "Editing page-2.ts",
        "timestamps": {
          "start": 1704729600000
        }
      }
    ],
    "client_status": {
      "desktop": "online",
      "mobile": "idle",
      "web": null
    },
    "guild_id": "987654321098765432"  // Context for this update
  }
}

Presence is Eventually Consistent

Session Resume and Event Consistency

The Resume protocol:

When a client connects, it receives a session_id and tracks a sequence number for each event received. If the connection drops:

Client reconnects to Gateway
Sends RESUME with session_id and last known sequence
Gateway replays all events since that sequence number
Client continues as if never disconnected

Resume Protocol
Resume Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Client sends RESUME
{
  "op": 6,
  "d": {
    "token": "user_auth_token",
    "session_id": "abc123def456",
    "seq": 42  // Last event sequence received
  }
}
 
// If session still valid, server sends:
{
  "op": 0,
  "t": "RESUMED",
  "s": 42,  // Confirms sequence
  "d": {}
}
 
// Then replays missed events:
{ "op": 0, "t": "MESSAGE_CREATE", "s": 43, "d": {...} }
{ "op": 0, "t": "MESSAGE_CREATE", "s": 44, "d": {...} }
{ "op": 0, "t": "TYPING_START", "s": 45, "d": {...} }
 
// If session expired (>15-30 seconds):
{
  "op": 9,  // Invalid Session
  "d": false  // Cannot resume, must re-identify
}

How does the Gateway remember events to replay?

Each Gateway maintains a per-session event buffer:

When events are sent to a connection, they're also written to a circular buffer
Buffer stores last ~1000 events or ~60 seconds, whichever is smaller
Session→Buffer mapping kept in memory (and optionally replicated)
On resume, Gateway reads from buffer starting at client's last sequence

If the buffer has been exhausted (client was disconnected too long), the session is invalid and client must re-identify, receiving full state (which is expensive but necessary).

Resume is Memory Intensive

Client-Side State Synchronization

The client maintains a local cache of Discord state—messages, channels, members, settings. This cache must stay synchronized with the server through the WebSocket connection.

Initial state load (READY event):

When a client identifies, the server sends a massive READY event containing:

User object (you)
User settings (theme, notifications)
DM channels
Guild list (basic info for each guild)
Relationships (friends, blocked users)
Presences (initial online friends)
Read states (unread count per channel)

For guilds, this is intentionally partial—full member lists for large guilds would be megabytes.

READY Event Structure
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
  "op": 0,
  "t": "READY",
  "s": 1,
  "d": {
    "v": 10,  // Gateway protocol version
    "user": {
      "id": "123456789012345678",
      "username": "user",
      "discriminator": "0",
      "avatar": "a_abc123"
    },
    "guilds": [
      // "Unavailable" guilds - just IDs, details sent via GUILD_CREATE
      { "id": "111", "unavailable": true },
      { "id": "222", "unavailable": true }
    ],
    "private_channels": [
      { "id": "333", "type": 1, "recipients": [...] }
    ],
    "session_id": "abc123def456",
    "resume_gateway_url": "wss://gateway-us-east1-b.discord.gg",
    "shard": [0, 1],  // For bots: shard ID, total shards
    "application": { "id": "...", "flags": ... }
  }
}

Lazy guild loading:

To keep READY fast, guilds are marked unavailable and detailed data arrives via separate GUILD_CREATE events. The client shows a loading state until each guild's data arrives.

Request Guild Members (lazy load):

When you click on a guild's member list:

Client sends REQUEST_GUILD_MEMBERS op
Server streams members in chunks of 1000
Client renders as chunks arrive
For very large guilds, only online members + search results are available

This lazy loading is essential—a user in 100 guilds with 1000 members each would otherwise need to load 100K member objects on startup.

Optimistic Updates

Summary: Real-Time at Scale

We've explored the sophisticated infrastructure that powers Discord's real-time communication. Let's consolidate the key insights:

Key Takeaways

•WebSocket is the backbone: Persistent, bidirectional connections enable sub-200ms message delivery
•Gateway servers are stateful: Each maintains ~100K connections with in-memory subscription state
•Pub/Sub enables efficient routing: Events flow only to Gateways with interested subscribers, not broadcast everywhere
•Hybrid HTTP/WebSocket: Writes via HTTP for reliability, reads via WebSocket for speed
•Presence is eventually consistent: Relaxed consistency enables batching and optimization
•Resume protocol handles disconnects: Session buffers allow seamless reconnection without re-sync
•Lazy loading is essential: Full state would overwhelm clients; load on demand instead

What's next:

Page Complete

2 / 5