System Design (HLD)Stateless vs Stateful Services

Stateless vs Stateful Services

LevelIntermediate

Duration75 mins

TopicStateless vs Stateful Services

2 / 5

Stateful — Session State Maintained

When State Must Live on the Server

In the previous page, we explored the elegance and power of stateless architecture. Statelessness is often the default recommendation for scalable systems—and rightfully so. But architecture is about trade-offs, not absolutes.

Some systems genuinely require stateful services. Not as a compromise, not as technical debt, but as the correct architectural choice given their requirements. Understanding when and how to embrace statefulness is just as important as understanding statelessness.

This page takes you deep into stateful architecture: its definition, legitimate use cases, implementation patterns, operational implications, and the design decisions that make stateful systems manageable at scale.

What You Will Learn

By the end of this page, you will understand stateful architecture thoroughly: why it exists, when it's necessary, how to implement it correctly, and how leading companies run stateful systems at scale. You'll be equipped to make informed choices about when state should live on your servers.

Defining Statefulness with Precision

Statefulness is an architectural property where servers maintain information about client sessions or interactions across multiple requests. The server "remembers" previous interactions, and this memory affects how subsequent requests are processed.

Let's formalize this definition:

Stateful Service Definition: A service is stateful if it maintains client-specific or session-specific data in local memory or storage, and this data influences the processing of subsequent requests from that client.

This definition has critical implications:

The Three Characteristics of Statefulness

•Session Affinity Required — Clients must return to the same server instance to access their state. This creates 'sticky' relationships between clients and specific servers.
•Local State Storage — The server holds state in its own memory or local storage, not (only) in external systems. This state is not automatically available to other instances.
•Sequential Request Dependencies — The processing of request N may depend on what happened in requests 1 through N-1. Order and history matter.

Critical Distinction

Stateful services are NOT simply services that interact with databases. A service that reads user data from PostgreSQL is not stateful—it's stateless with external storage. Statefulness specifically means the server instance itself maintains state that would be lost if that instance disappeared.

The State Location Spectrum:

Understanding statefulness requires clarity about where state lives:

Client-side state — Stored in browser/app, sent with each request (cookies, tokens). Server is stateless.
External shared state — Stored in databases/Redis accessible by all instances. Server is stateless.
Server-local state — Stored in the specific server instance's memory/storage. This makes the service stateful.
Hybrid state — Some state local, some external. Partially stateful.

True statefulness involves option 3 or 4—state that is bound to a specific server instance.

Why Statefulness Exists: Legitimate Use Cases

If statelessness is so advantageous, why would anyone choose statefulness? Because certain requirements are fundamentally better served—or only achievable—with stateful architectures.

Use Case 1: Real-Time Bidirectional Communication

WebSocket connections, server-sent events, and gRPC streaming all maintain long-lived connections with client-specific state:

websocket-stateful.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// WebSocket connections are inherently stateful
class ChatServer {
  // State: Map of active connections per user
  // This state lives in THIS server instance's memory
  private connections = new Map<string, WebSocket[]>();
  
  // State: Per-connection metadata
  private connectionMeta = new WeakMap<WebSocket, ConnectionMetadata>();
  
  handleConnection(ws: WebSocket, userId: string) {
    // Store connection state locally
    if (!this.connections.has(userId)) {
      this.connections.set(userId, []);
    }
    this.connections.get(userId)!.push(ws);
    
    // Store metadata about this specific connection
    this.connectionMeta.set(ws, {
      userId,
      connectedAt: Date.now(),
      lastHeartbeat: Date.now(),
      messageCount: 0  // Accumulating state across messages
    });
  }
  
  sendToUser(userId: string, message: Message) {
    // Can only send to users connected to THIS server
    // This is the fundamental limitation of stateful services
    const userConnections = this.connections.get(userId);
    
    if (userConnections) {
      userConnections.forEach(ws => ws.send(JSON.stringify(message)));
    }
    // If user is connected to a DIFFERENT server, we can't reach them
    // This requires additional coordination (pub/sub, etc.)
  }
}

Use Case 2: In-Memory Computation and Caching

Some systems require holding large datasets or computation state in memory for performance:

Memory-Intensive Stateful Systems

•Game servers — Hold entire game world state, player positions, physics simulation in memory. Rebuilding from database each frame is impossible.
•Real-time analytics — Maintain sliding windows, aggregations, and streaming computations. State is the computation itself.
•Machine learning inference — Large models loaded into memory. Model state includes cached computations, batch buffers.
•Collaborative editing — Document state, cursor positions, conflict resolution all maintained in server memory for sub-100ms latency.
•Financial trading systems — Order books, position tracking, risk calculations updated on every tick. State is the system.

Use Case 3: Workflow and Saga Orchestration

Long-running workflows often maintain state about multi-step processes:

workflow-orchestrator.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Workflow orchestrator maintaining execution state
class OrderFulfillmentOrchestrator {
  // State: Active workflow executions
  private activeWorkflows = new Map<string, WorkflowState>();
  
  async executeWorkflow(orderId: string) {
    // Create and track workflow state
    const state: WorkflowState = {
      orderId,
      step: 'payment',
      startedAt: Date.now(),
      compensations: [],  // For rollback if needed
      retryCount: 0
    };
    this.activeWorkflows.set(orderId, state);
    
    try {
      // Step 1: Payment
      state.step = 'payment';
      const paymentResult = await this.processPayment(orderId);
      state.compensations.push(() => this.refundPayment(paymentResult.id));
      
      // Step 2: Inventory
      state.step = 'inventory';
      const reservationResult = await this.reserveInventory(orderId);
      state.compensations.push(() => this.releaseInventory(reservationResult.id));
      
      // Step 3: Shipping
      state.step = 'shipping';
      await this.initiateShipping(orderId);
      
      // Success
      state.step = 'completed';
      
    } catch (error) {
      // Saga compensation - rollback in reverse order
      // Relies on state accumulated during execution
      for (const compensation of state.compensations.reverse()) {
        await compensation();
      }
      state.step = 'rolled_back';
    }
  }
  
  getWorkflowStatus(orderId: string): WorkflowState | undefined {
    // Can only check workflows running on THIS server
    return this.activeWorkflows.get(orderId);
  }
}

The Common Thread

All legitimate statefulness use cases share a pattern: the cost of not having local state (latency, complexity, or impossibility) exceeds the cost of managing local state (scaling complexity, failure handling). When externalizing state would make the system unusable or impossible, statefulness is the correct choice.

Types of Server State

Not all server state is equal. Understanding the different types of state helps you make better decisions about what state to keep locally and how to manage it.

1. Connection State

The most common form of server state in modern systems. Each active connection has associated state:

Connection State Examples
Protocol	Connection State Includes	Lifetime
WebSocket	Socket handle, authentication context, subscriptions, pending messages	Session duration (minutes to hours)
HTTP/2	Stream multiplexing state, flow control windows, header compression context	Connection duration (seconds to minutes)
gRPC Streaming	Stream state, buffered messages, cancellation tokens	Stream duration (variable)
Database Connection	Transaction state, prepared statements, session variables	Request or connection pool lifetime

2. Session State

Information about an ongoing user session that spans multiple requests:

Session State Examples

•Authentication tokens and refresh state — Token validation caches, session expiry tracking
•User preferences loaded into memory — Language, theme, feature flags for fast access
•Shopping cart in e-commerce — Items, quantities, applied discounts (if not externalized)
•Form wizard progress — Multi-step form data accumulated across requests
•Rate limiting counters — Request counts per client for throttling (when not Redis-backed)

3. Computation State

State that represents ongoing or cached computations:

computation-state.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Examples of computation state in stateful services
 
class StreamingAggregator {
  // State: Sliding window for real-time metrics
  private windows = new Map<string, SlidingWindow>();
  
  // This state is the computation - it cannot be externalized
  // without destroying the real-time property
  recordEvent(metricId: string, value: number, timestamp: number) {
    const window = this.windows.get(metricId) ?? new SlidingWindow(60_000);
    window.add(value, timestamp);
    this.windows.set(metricId, window);
  }
  
  getAverage(metricId: string): number {
    return this.windows.get(metricId)?.average() ?? 0;
  }
}
 
class GamePhysicsEngine {
  // State: Entity positions, velocities, collision state
  private entities = new Map<string, PhysicsEntity>();
  
  // State: Spatial partitioning structure for collision detection
  private spatialHash = new SpatialHashGrid(100);
  
  // This state represents the simulation - must be in memory
  tick(deltaTime: number) {
    for (const entity of this.entities.values()) {
      // Update positions
      entity.position.x += entity.velocity.x * deltaTime;
      entity.position.y += entity.velocity.y * deltaTime;
      
      // Update spatial hash
      this.spatialHash.update(entity.id, entity.position);
    }
    
    // Collision detection using spatial hash
    this.resolveCollisions();
  }
}
 
class MLModelServer {
  // State: Large neural network weights loaded in memory
  private model: NeuralNetwork;
  
  // State: Batch buffer for efficient inference
  private batchBuffer: InferenceRequest[] = [];
  private batchTimer: NodeJS.Timeout | null = null;
  
  // Loading model from disk takes seconds - must stay in memory
  async initialize() {
    this.model = await loadModel('models/recommendation-v3.onnx');
    // 500MB+ model now in server memory
  }
  
  async infer(input: Tensor): Promise<Tensor> {
    return this.model.forward(input);
  }
}

4. Actor/Entity State

In actor-model systems (Akka, Orleans, Dapr), actors encapsulate state:

Actor State Locality

Actor models (Microsoft Orleans, Akka) embrace statefulness deliberately. Each actor holds its state in memory, and the framework handles actor placement, migration, and persistence. This is 'managed statefulness'—the complexity is handled by infrastructure rather than application code.

Session Affinity: Routing to State

The defining operational characteristic of stateful services is session affinity (also called "sticky sessions"). Clients must be consistently routed to the server instance holding their state.

How Session Affinity Works:

Load balancers implement affinity through various mechanisms:

Session Affinity Mechanisms
Mechanism	How It Works	Pros	Cons
Cookie-based	LB sets/reads cookie with server ID	Works through NAT, browser handles persistence	Cookie size limits, requires cookie support
IP Hash	Hash of client IP determines server	No cookies needed, works for any protocol	Breaks with NAT/proxy, mobile IP changes
Header-based	Custom header indicates target server	Flexible, controllable by application	Requires application cooperation
Connection-based	Same TCP connection stays on same server	Perfect for WebSocket/streams	New connections may hit different server
Consistent Hashing	Hash of session ID to server ring	Handles server changes gracefully	More complex implementation

nginx-session-affinity.conf
Nginx Config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# NGINX configuration for session affinity
 
# Method 1: IP Hash - Simple but imperfect
upstream backend_ip_hash {
    ip_hash;
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}
 
# Method 2: Cookie-based sticky sessions (NGINX Plus)
upstream backend_sticky {
    # Creates/reads cookie named 'srv_id' with 1-hour expiry
    sticky cookie srv_id expires=1h domain=.example.com path=/;
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}
 
# Method 3: Consistent hashing on a header/variable
upstream backend_consistent {
    hash $cookie_session_id consistent;
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}
 
server {
    listen 80;
    
    location /api/ {
        proxy_pass http://backend_sticky;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        
        # WebSocket support for stateful connections
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

The Session Affinity Problem: Server Failure

The critical weakness of session affinity is server failure handling. When a server holding state dies:

State is lost — Unless replicated or persisted, all in-memory state disappears
Clients must re-establish — Sessions, connections, computations must restart
Load imbalance during recovery — Remaining servers absorb displaced load
User experience degradation — Users may see errors, lost progress, or need to re-authenticate

Without Proper Handling

•Server-3 crashes
•All sessions on Server-3 lost immediately
•Users see 500 errors or connection drops
•Load balancer eventually marks server down
•Users retry, hit different server, no state
•Must re-authenticate, restart workflows

With Proper Handling

•Server-3 crashes
•State was replicated to Server-4 (hot replica)
•Health check detects failure in <5s
•Load balancer updates affinity to Server-4
•Clients auto-reconnect to Server-4
•Continued operation with minimal disruption

The Reliability Tax

Every stateful system must answer: 'What happens when a server holding state fails?' The answer—replication, persistence, graceful degradation, or accepting loss—defines your reliability story. This 'reliability tax' is unavoidable with statefulness.

State Replication Strategies

To achieve reliability in stateful systems, state must be replicated or persisted. Several strategies exist, each with different trade-offs.

Strategy 1: Active-Passive Replication

One server (primary) holds active state; others (secondaries) receive replicated copies:

active-passive-replication.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Active-Passive replication pattern
class ReplicatedSessionStore {
  private localState = new Map<string, SessionData>();
  private role: 'primary' | 'secondary';
  private peerNodes: PeerConnection[];
  
  async updateSession(sessionId: string, data: SessionData) {
    if (this.role === 'secondary') {
      throw new Error('Cannot update on secondary node');
    }
    
    // Update local state
    this.localState.set(sessionId, data);
    
    // Replicate to secondaries (sync or async)
    const replicationPromises = this.peerNodes.map(
      peer => peer.replicate(sessionId, data)
    );
    
    // Choice: Wait for replication (consistent but slower)
    // or continue immediately (faster but risk data loss)
    await Promise.all(replicationPromises);  // Sync replication
    
    return data;
  }
  
  async promoteToLeader() {
    // Called when primary fails and this node takes over
    this.role = 'primary';
    // State is already local from replication
  }
}

Strategy 2: Multi-Primary / Conflict Resolution

All nodes can accept writes; conflicts are resolved via CRDTs or last-write-wins:

multi-primary-crdt.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Multi-primary with CRDT for conflict resolution
class DistributedCounter {
  // CRDT: G-Counter (grow-only counter)
  // Each node tracks its own increments
  private nodeId: string;
  private counters = new Map<string, number>();
  
  constructor(nodeId: string) {
    this.nodeId = nodeId;
    this.counters.set(nodeId, 0);
  }
  
  increment() {
    // Only increment this node's counter
    const current = this.counters.get(this.nodeId) ?? 0;
    this.counters.set(this.nodeId, current + 1);
  }
  
  // Merge function - CRDT magic
  // No conflict possible: always take max
  merge(other: Map<string, number>) {
    for (const [nodeId, count] of other) {
      const myCount = this.counters.get(nodeId) ?? 0;
      this.counters.set(nodeId, Math.max(myCount, count));
    }
  }
  
  // Read total across all nodes
  value(): number {
    let total = 0;
    for (const count of this.counters.values()) {
      total += count;
    }
    return total;
  }
}

Strategy 3: Periodic Checkpointing

State is periodically written to durable storage for recovery:

State Replication Strategy Comparison
Strategy	Consistency	Availability	Complexity	Best For
Active-Passive (Sync)	Strong	Lower during failover	Medium	Financial, order processing
Active-Passive (Async)	Eventual (risk of loss)	High	Medium	Gaming, sessions
Multi-Primary CRDT	Eventual (convergent)	High	High	Counters, sets, distributed data
Checkpointing	Point-in-time	High (between checkpoints)	Low	Analytics, batch processing
In-Memory Grid	Configurable	High	Medium	Caching, session stores

Choosing Your Strategy

The right replication strategy depends on your consistency requirements and tolerance for data loss. Financial systems demand synchronous replication. Gaming systems might accept occasional state loss for lower latency. Know your domain's requirements.

Real-World Stateful Architectures

Let's examine how leading technology companies implement stateful architectures at scale.

Discord: WebSocket Servers at Scale

Discord handles millions of concurrent WebSocket connections—a fundamentally stateful workload:

Gateway servers hold WebSocket connections and per-connection state
Session state includes subscriptions (which guilds/channels to receive events for)
Pub/sub backbone distributes events to the right gateway servers
Stateful but horizontally scaled — Millions of connections across thousands of servers
Graceful reconnection — Clients automatically reconnect to potentially different servers

Stateful Architectures at Major Companies
Company	Stateful Component	Scale	State Management
Discord	WebSocket gateways	Millions of connections	Pub/sub for cross-server events, session resumption
Slack	Real-time messaging servers	Millions of connected clients	Channel-based sharding, presence state
Epic Games (Fortnite)	Game servers	100 players/server	In-memory game state, eventual storage
Twitch	Chat servers	Hundreds of thousands/channel	Connection state, IRC-style fanout
Microsoft Orleans	Actor-based services	Millions of actors	Virtual actors, automatic persistence

Epic Games: Fortnite Game Servers

Fortnite game servers are intensely stateful:

100 players per match — All player positions, inventories, health, building states in memory
60 tick simulation — Physics, hit detection, game logic updated 60 times/second
Authoritative server — Server state is ground truth; clients are thin
Match lifecycle — Server spins up for a match, holds all state, terminates after match ends
No external state during gameplay — Too slow; all state must be in-process

Microsoft Orleans: Virtual Actors

Orleans (used by Halo, Azure services) provides a framework for stateful services:

orleans-actor-example.cs
C#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Orleans Virtual Actor (Grain) - Managed Statefulness
public interface IPlayerGrain : IGrainWithStringKey
{
    Task<PlayerState> GetState();
    Task UpdatePosition(Vector3 position);
    Task AddItem(Item item);
}
 
public class PlayerGrain : Grain, IPlayerGrain
{
    // STATE: Held in memory by this grain instance
    // Orleans handles activation, placement, and persistence
    private PlayerState _state;
    
    public override async Task OnActivateAsync()
    {
        // Load state from storage on first access
        _state = await ReadStateFromStorageAsync();
    }
    
    public Task<PlayerState> GetState() => Task.FromResult(_state);
    
    public async Task UpdatePosition(Vector3 position)
    {
        // Update in-memory state
        _state.Position = position;
        _state.LastUpdate = DateTime.UtcNow;
        
        // Orleans persists state periodically or on-demand
        // Framework handles all the complexity
    }
    
    public async Task AddItem(Item item)
    {
        _state.Inventory.Add(item);
        // State persists across server restarts
        await WriteStateToStorageAsync();
    }
}
 
// Usage - Orleans routes to correct server transparently
var player = grainFactory.GetGrain<IPlayerGrain>("player-12345");
await player.UpdatePosition(newPosition);  // Routed to stateful grain

Managed Statefulness

Frameworks like Orleans, Akka, and Dapr provide 'managed statefulness'—you write stateful logic, and the framework handles placement, replication, persistence, and failover. This dramatically reduces the operational burden of stateful services.

Operational Challenges of Statefulness

Stateful services demand significantly more operational investment than stateless services. Understanding these challenges helps you prepare appropriately.

Challenge 1: Deployment Complexity

You can't simply roll out new instances and kill old ones. You must handle in-flight state:

Deployment Considerations

•Graceful draining — Stop accepting new connections while existing ones complete
•State migration — Move state from old instances to new ones before termination
•Connection handoff — WebSocket clients must reconnect; coordinate the transition
•Rolling updates take longer — Can't just swap instances; must wait for drain
•Rollback complexity — State may have mutated; can't simply revert code

Challenge 2: Scaling Constraints

Horizontal scaling is possible but complex:

Scaling Stateful vs Stateless Services
Operation	Stateless	Stateful
Add instance	Immediate, no preparation	Must configure affinity, may need state preload
Remove instance	Terminate immediately	Drain connections, migrate state, then terminate
Auto-scale trigger	CPU/request count	Complex: connections, memory, state size
Scale-down risk	None	State loss if not handled, user disruption
Load balancing	Any algorithm works	Must use affinity-aware algorithms

Challenge 3: Observability and Debugging

Debugging stateful systems is harder because behavior depends on accumulated state:

Non-reproducible issues — "It only happens after 100 requests in a specific sequence"
State corruption — Bugs may silently corrupt state, manifesting later
Memory growth — State accumulation can cause memory leaks
Per-instance metrics diverge — Each server has different state, different behavior
Distributed debugging — Following a user's journey across reconnections requires correlation

The Operational Burden

Stateful services require more sophisticated operations: custom deployment strategies, state migration tooling, connection draining logic, and state-aware monitoring. Only choose statefulness when the benefits outweigh this significant operational investment.

Hybrid Approaches: Best of Both Worlds

Most production systems use hybrid approaches—stateless services for general request handling with stateful components for specific needs.

Pattern 1: Stateless API Tier + Stateful Connection Tier

hybrid-architecture.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Hybrid Architecture Example
 
// STATELESS: API servers handle business logic
// No local state, horizontally scalable
class OrderApiServer {
  async createOrder(request: CreateOrderRequest): Promise<Order> {
    // All state in databases/caches
    const order = await database.orders.create(request);
    
    // Notify stateful tier about new order
    await pubsub.publish('order-created', order);
    
    return order;
  }
}
 
// STATEFUL: WebSocket servers handle real-time connections
// Holds connection state, requires affinity
class RealtimeServer {
  private connections = new Map<string, WebSocket>();
  
  constructor() {
    // Subscribe to events from stateless tier
    pubsub.subscribe('order-created', (order) => {
      // Find connected users who should see this order
      const userId = order.userId;
      const ws = this.connections.get(userId);
      if (ws) {
        ws.send(JSON.stringify({ type: 'order-created', order }));
      }
    });
  }
  
  handleConnection(ws: WebSocket, userId: string) {
    this.connections.set(userId, ws);
    ws.on('close', () => this.connections.delete(userId));
  }
}
 
// ARCHITECTURE:
// [Clients] --> [Load Balancer] --> [Stateless API Servers] --> [Database]
//                    |                                              |
//                    v                                         [Pub/Sub]
//           [Stateful WS Servers] <---------------------------------+

Pattern 2: Stateless with Distributed Cache

Services are stateless, but frequently accessed state is cached locally with eventual consistency:

Benefits of This Hybrid

•API servers remain stateless — Easy scaling, deployment, failure recovery
•Local caching improves performance — Frequently accessed data in memory
•Cache invalidation via pub/sub — State changes propagate to all caches
•Graceful degradation — Cache miss falls back to external store
•Lower external store load — Cache absorbs read traffic

The Pragmatic Path

Most successful systems are hybrids. Stateless for the bulk of processing, stateful only where truly required. This minimizes operational complexity while still supporting real-time features, in-memory computation, or other stateful needs.

Summary: The Case for Statefulness

We've explored stateful architecture comprehensively—when it's needed, how it works, and the operational considerations it demands. Let's consolidate the key insights:

Key Takeaways

•Statefulness means servers hold client-specific state — Session affinity is required, and state loss is possible on failure.
•Legitimate use cases exist — WebSockets, game servers, real-time analytics, workflow orchestration, ML inference.
•Types of state vary — Connection state, session state, computation state, actor state—each with different characteristics.
•Session affinity is the defining operational characteristic — Clients must return to the same server holding their state.
•Replication strategies address reliability — Active-passive, CRDTs, checkpointing—choose based on consistency needs.
•Operational burden is significant — Deployments, scaling, debugging are all more complex than stateless equivalents.
•Hybrid approaches are common and pragmatic — Stateless for most processing, stateful only where truly required.

What's next:

Now that we understand both stateless and stateful architectures, we'll explore the implications for scaling. How do these architectural choices affect your ability to handle growth? What constraints emerge as you scale each type of service? The next page dives deep into the scaling calculus of stateless vs stateful systems.

Page Complete

You now understand stateful architecture thoroughly—its definition, legitimate use cases, state types, session affinity requirements, replication strategies, real-world implementations, and operational challenges. Combined with the previous page on statelessness, you can now reason clearly about where state should live in your systems.

2 / 5

Loading learning content...

System Design (HLD)Stateless vs Stateful Services

Stateless vs Stateful Services

LevelIntermediate

Duration75 mins

TopicStateless vs Stateful Services

2 / 5

Stateful — Session State Maintained

When State Must Live on the Server

What You Will Learn

Defining Statefulness with Precision

Let's formalize this definition:

Stateful Service Definition: A service is stateful if it maintains client-specific or session-specific data in local memory or storage, and this data influences the processing of subsequent requests from that client.

This definition has critical implications:

The Three Characteristics of Statefulness

•Session Affinity Required — Clients must return to the same server instance to access their state. This creates 'sticky' relationships between clients and specific servers.
•Local State Storage — The server holds state in its own memory or local storage, not (only) in external systems. This state is not automatically available to other instances.
•Sequential Request Dependencies — The processing of request N may depend on what happened in requests 1 through N-1. Order and history matter.

Critical Distinction

The State Location Spectrum:

Understanding statefulness requires clarity about where state lives:

Client-side state — Stored in browser/app, sent with each request (cookies, tokens). Server is stateless.
External shared state — Stored in databases/Redis accessible by all instances. Server is stateless.
Server-local state — Stored in the specific server instance's memory/storage. This makes the service stateful.
Hybrid state — Some state local, some external. Partially stateful.

True statefulness involves option 3 or 4—state that is bound to a specific server instance.

Why Statefulness Exists: Legitimate Use Cases

If statelessness is so advantageous, why would anyone choose statefulness? Because certain requirements are fundamentally better served—or only achievable—with stateful architectures.

Use Case 1: Real-Time Bidirectional Communication

WebSocket connections, server-sent events, and gRPC streaming all maintain long-lived connections with client-specific state:

websocket-stateful.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// WebSocket connections are inherently stateful
class ChatServer {
  // State: Map of active connections per user
  // This state lives in THIS server instance's memory
  private connections = new Map<string, WebSocket[]>();
  
  // State: Per-connection metadata
  private connectionMeta = new WeakMap<WebSocket, ConnectionMetadata>();
  
  handleConnection(ws: WebSocket, userId: string) {
    // Store connection state locally
    if (!this.connections.has(userId)) {
      this.connections.set(userId, []);
    }
    this.connections.get(userId)!.push(ws);
    
    // Store metadata about this specific connection
    this.connectionMeta.set(ws, {
      userId,
      connectedAt: Date.now(),
      lastHeartbeat: Date.now(),
      messageCount: 0  // Accumulating state across messages
    });
  }
  
  sendToUser(userId: string, message: Message) {
    // Can only send to users connected to THIS server
    // This is the fundamental limitation of stateful services
    const userConnections = this.connections.get(userId);
    
    if (userConnections) {
      userConnections.forEach(ws => ws.send(JSON.stringify(message)));
    }
    // If user is connected to a DIFFERENT server, we can't reach them
    // This requires additional coordination (pub/sub, etc.)
  }
}

Use Case 2: In-Memory Computation and Caching

Some systems require holding large datasets or computation state in memory for performance:

Memory-Intensive Stateful Systems

•Game servers — Hold entire game world state, player positions, physics simulation in memory. Rebuilding from database each frame is impossible.
•Real-time analytics — Maintain sliding windows, aggregations, and streaming computations. State is the computation itself.
•Machine learning inference — Large models loaded into memory. Model state includes cached computations, batch buffers.
•Collaborative editing — Document state, cursor positions, conflict resolution all maintained in server memory for sub-100ms latency.
•Financial trading systems — Order books, position tracking, risk calculations updated on every tick. State is the system.

Use Case 3: Workflow and Saga Orchestration

Long-running workflows often maintain state about multi-step processes:

workflow-orchestrator.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Workflow orchestrator maintaining execution state
class OrderFulfillmentOrchestrator {
  // State: Active workflow executions
  private activeWorkflows = new Map<string, WorkflowState>();
  
  async executeWorkflow(orderId: string) {
    // Create and track workflow state
    const state: WorkflowState = {
      orderId,
      step: 'payment',
      startedAt: Date.now(),
      compensations: [],  // For rollback if needed
      retryCount: 0
    };
    this.activeWorkflows.set(orderId, state);
    
    try {
      // Step 1: Payment
      state.step = 'payment';
      const paymentResult = await this.processPayment(orderId);
      state.compensations.push(() => this.refundPayment(paymentResult.id));
      
      // Step 2: Inventory
      state.step = 'inventory';
      const reservationResult = await this.reserveInventory(orderId);
      state.compensations.push(() => this.releaseInventory(reservationResult.id));
      
      // Step 3: Shipping
      state.step = 'shipping';
      await this.initiateShipping(orderId);
      
      // Success
      state.step = 'completed';
      
    } catch (error) {
      // Saga compensation - rollback in reverse order
      // Relies on state accumulated during execution
      for (const compensation of state.compensations.reverse()) {
        await compensation();
      }
      state.step = 'rolled_back';
    }
  }
  
  getWorkflowStatus(orderId: string): WorkflowState | undefined {
    // Can only check workflows running on THIS server
    return this.activeWorkflows.get(orderId);
  }
}

The Common Thread

Types of Server State

Not all server state is equal. Understanding the different types of state helps you make better decisions about what state to keep locally and how to manage it.

1. Connection State

The most common form of server state in modern systems. Each active connection has associated state:

Connection State Examples
Protocol	Connection State Includes	Lifetime
WebSocket	Socket handle, authentication context, subscriptions, pending messages	Session duration (minutes to hours)
HTTP/2	Stream multiplexing state, flow control windows, header compression context	Connection duration (seconds to minutes)
gRPC Streaming	Stream state, buffered messages, cancellation tokens	Stream duration (variable)
Database Connection	Transaction state, prepared statements, session variables	Request or connection pool lifetime

2. Session State

Information about an ongoing user session that spans multiple requests:

Session State Examples

•Authentication tokens and refresh state — Token validation caches, session expiry tracking
•User preferences loaded into memory — Language, theme, feature flags for fast access
•Shopping cart in e-commerce — Items, quantities, applied discounts (if not externalized)
•Form wizard progress — Multi-step form data accumulated across requests
•Rate limiting counters — Request counts per client for throttling (when not Redis-backed)

3. Computation State

State that represents ongoing or cached computations:

computation-state.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Examples of computation state in stateful services
 
class StreamingAggregator {
  // State: Sliding window for real-time metrics
  private windows = new Map<string, SlidingWindow>();
  
  // This state is the computation - it cannot be externalized
  // without destroying the real-time property
  recordEvent(metricId: string, value: number, timestamp: number) {
    const window = this.windows.get(metricId) ?? new SlidingWindow(60_000);
    window.add(value, timestamp);
    this.windows.set(metricId, window);
  }
  
  getAverage(metricId: string): number {
    return this.windows.get(metricId)?.average() ?? 0;
  }
}
 
class GamePhysicsEngine {
  // State: Entity positions, velocities, collision state
  private entities = new Map<string, PhysicsEntity>();
  
  // State: Spatial partitioning structure for collision detection
  private spatialHash = new SpatialHashGrid(100);
  
  // This state represents the simulation - must be in memory
  tick(deltaTime: number) {
    for (const entity of this.entities.values()) {
      // Update positions
      entity.position.x += entity.velocity.x * deltaTime;
      entity.position.y += entity.velocity.y * deltaTime;
      
      // Update spatial hash
      this.spatialHash.update(entity.id, entity.position);
    }
    
    // Collision detection using spatial hash
    this.resolveCollisions();
  }
}
 
class MLModelServer {
  // State: Large neural network weights loaded in memory
  private model: NeuralNetwork;
  
  // State: Batch buffer for efficient inference
  private batchBuffer: InferenceRequest[] = [];
  private batchTimer: NodeJS.Timeout | null = null;
  
  // Loading model from disk takes seconds - must stay in memory
  async initialize() {
    this.model = await loadModel('models/recommendation-v3.onnx');
    // 500MB+ model now in server memory
  }
  
  async infer(input: Tensor): Promise<Tensor> {
    return this.model.forward(input);
  }
}

4. Actor/Entity State

In actor-model systems (Akka, Orleans, Dapr), actors encapsulate state:

Actor State Locality

Session Affinity: Routing to State

The defining operational characteristic of stateful services is session affinity (also called "sticky sessions"). Clients must be consistently routed to the server instance holding their state.

How Session Affinity Works:

Load balancers implement affinity through various mechanisms:

Session Affinity Mechanisms
Mechanism	How It Works	Pros	Cons
Cookie-based	LB sets/reads cookie with server ID	Works through NAT, browser handles persistence	Cookie size limits, requires cookie support
IP Hash	Hash of client IP determines server	No cookies needed, works for any protocol	Breaks with NAT/proxy, mobile IP changes
Header-based	Custom header indicates target server	Flexible, controllable by application	Requires application cooperation
Connection-based	Same TCP connection stays on same server	Perfect for WebSocket/streams	New connections may hit different server
Consistent Hashing	Hash of session ID to server ring	Handles server changes gracefully	More complex implementation

nginx-session-affinity.conf
Nginx Config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# NGINX configuration for session affinity
 
# Method 1: IP Hash - Simple but imperfect
upstream backend_ip_hash {
    ip_hash;
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}
 
# Method 2: Cookie-based sticky sessions (NGINX Plus)
upstream backend_sticky {
    # Creates/reads cookie named 'srv_id' with 1-hour expiry
    sticky cookie srv_id expires=1h domain=.example.com path=/;
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}
 
# Method 3: Consistent hashing on a header/variable
upstream backend_consistent {
    hash $cookie_session_id consistent;
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}
 
server {
    listen 80;
    
    location /api/ {
        proxy_pass http://backend_sticky;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        
        # WebSocket support for stateful connections
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

The Session Affinity Problem: Server Failure

The critical weakness of session affinity is server failure handling. When a server holding state dies:

State is lost — Unless replicated or persisted, all in-memory state disappears
Clients must re-establish — Sessions, connections, computations must restart
Load imbalance during recovery — Remaining servers absorb displaced load
User experience degradation — Users may see errors, lost progress, or need to re-authenticate

Without Proper Handling

•Server-3 crashes
•All sessions on Server-3 lost immediately
•Users see 500 errors or connection drops
•Load balancer eventually marks server down
•Users retry, hit different server, no state
•Must re-authenticate, restart workflows

With Proper Handling

•Server-3 crashes
•State was replicated to Server-4 (hot replica)
•Health check detects failure in <5s
•Load balancer updates affinity to Server-4
•Clients auto-reconnect to Server-4
•Continued operation with minimal disruption

The Reliability Tax

State Replication Strategies

To achieve reliability in stateful systems, state must be replicated or persisted. Several strategies exist, each with different trade-offs.

Strategy 1: Active-Passive Replication

One server (primary) holds active state; others (secondaries) receive replicated copies:

active-passive-replication.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Active-Passive replication pattern
class ReplicatedSessionStore {
  private localState = new Map<string, SessionData>();
  private role: 'primary' | 'secondary';
  private peerNodes: PeerConnection[];
  
  async updateSession(sessionId: string, data: SessionData) {
    if (this.role === 'secondary') {
      throw new Error('Cannot update on secondary node');
    }
    
    // Update local state
    this.localState.set(sessionId, data);
    
    // Replicate to secondaries (sync or async)
    const replicationPromises = this.peerNodes.map(
      peer => peer.replicate(sessionId, data)
    );
    
    // Choice: Wait for replication (consistent but slower)
    // or continue immediately (faster but risk data loss)
    await Promise.all(replicationPromises);  // Sync replication
    
    return data;
  }
  
  async promoteToLeader() {
    // Called when primary fails and this node takes over
    this.role = 'primary';
    // State is already local from replication
  }
}

Strategy 2: Multi-Primary / Conflict Resolution

All nodes can accept writes; conflicts are resolved via CRDTs or last-write-wins:

multi-primary-crdt.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Multi-primary with CRDT for conflict resolution
class DistributedCounter {
  // CRDT: G-Counter (grow-only counter)
  // Each node tracks its own increments
  private nodeId: string;
  private counters = new Map<string, number>();
  
  constructor(nodeId: string) {
    this.nodeId = nodeId;
    this.counters.set(nodeId, 0);
  }
  
  increment() {
    // Only increment this node's counter
    const current = this.counters.get(this.nodeId) ?? 0;
    this.counters.set(this.nodeId, current + 1);
  }
  
  // Merge function - CRDT magic
  // No conflict possible: always take max
  merge(other: Map<string, number>) {
    for (const [nodeId, count] of other) {
      const myCount = this.counters.get(nodeId) ?? 0;
      this.counters.set(nodeId, Math.max(myCount, count));
    }
  }
  
  // Read total across all nodes
  value(): number {
    let total = 0;
    for (const count of this.counters.values()) {
      total += count;
    }
    return total;
  }
}

Strategy 3: Periodic Checkpointing

State is periodically written to durable storage for recovery:

State Replication Strategy Comparison
Strategy	Consistency	Availability	Complexity	Best For
Active-Passive (Sync)	Strong	Lower during failover	Medium	Financial, order processing
Active-Passive (Async)	Eventual (risk of loss)	High	Medium	Gaming, sessions
Multi-Primary CRDT	Eventual (convergent)	High	High	Counters, sets, distributed data
Checkpointing	Point-in-time	High (between checkpoints)	Low	Analytics, batch processing
In-Memory Grid	Configurable	High	Medium	Caching, session stores

Choosing Your Strategy

Real-World Stateful Architectures

Let's examine how leading technology companies implement stateful architectures at scale.

Discord: WebSocket Servers at Scale

Discord handles millions of concurrent WebSocket connections—a fundamentally stateful workload:

Gateway servers hold WebSocket connections and per-connection state
Session state includes subscriptions (which guilds/channels to receive events for)
Pub/sub backbone distributes events to the right gateway servers
Stateful but horizontally scaled — Millions of connections across thousands of servers
Graceful reconnection — Clients automatically reconnect to potentially different servers

Stateful Architectures at Major Companies
Company	Stateful Component	Scale	State Management
Discord	WebSocket gateways	Millions of connections	Pub/sub for cross-server events, session resumption
Slack	Real-time messaging servers	Millions of connected clients	Channel-based sharding, presence state
Epic Games (Fortnite)	Game servers	100 players/server	In-memory game state, eventual storage
Twitch	Chat servers	Hundreds of thousands/channel	Connection state, IRC-style fanout
Microsoft Orleans	Actor-based services	Millions of actors	Virtual actors, automatic persistence

Epic Games: Fortnite Game Servers

Fortnite game servers are intensely stateful:

100 players per match — All player positions, inventories, health, building states in memory
60 tick simulation — Physics, hit detection, game logic updated 60 times/second
Authoritative server — Server state is ground truth; clients are thin
Match lifecycle — Server spins up for a match, holds all state, terminates after match ends
No external state during gameplay — Too slow; all state must be in-process

Microsoft Orleans: Virtual Actors

Orleans (used by Halo, Azure services) provides a framework for stateful services:

orleans-actor-example.cs
C#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Orleans Virtual Actor (Grain) - Managed Statefulness
public interface IPlayerGrain : IGrainWithStringKey
{
    Task<PlayerState> GetState();
    Task UpdatePosition(Vector3 position);
    Task AddItem(Item item);
}
 
public class PlayerGrain : Grain, IPlayerGrain
{
    // STATE: Held in memory by this grain instance
    // Orleans handles activation, placement, and persistence
    private PlayerState _state;
    
    public override async Task OnActivateAsync()
    {
        // Load state from storage on first access
        _state = await ReadStateFromStorageAsync();
    }
    
    public Task<PlayerState> GetState() => Task.FromResult(_state);
    
    public async Task UpdatePosition(Vector3 position)
    {
        // Update in-memory state
        _state.Position = position;
        _state.LastUpdate = DateTime.UtcNow;
        
        // Orleans persists state periodically or on-demand
        // Framework handles all the complexity
    }
    
    public async Task AddItem(Item item)
    {
        _state.Inventory.Add(item);
        // State persists across server restarts
        await WriteStateToStorageAsync();
    }
}
 
// Usage - Orleans routes to correct server transparently
var player = grainFactory.GetGrain<IPlayerGrain>("player-12345");
await player.UpdatePosition(newPosition);  // Routed to stateful grain

Managed Statefulness

Operational Challenges of Statefulness

Stateful services demand significantly more operational investment than stateless services. Understanding these challenges helps you prepare appropriately.

Challenge 1: Deployment Complexity

You can't simply roll out new instances and kill old ones. You must handle in-flight state:

Deployment Considerations

•Graceful draining — Stop accepting new connections while existing ones complete
•State migration — Move state from old instances to new ones before termination
•Connection handoff — WebSocket clients must reconnect; coordinate the transition
•Rolling updates take longer — Can't just swap instances; must wait for drain
•Rollback complexity — State may have mutated; can't simply revert code

Challenge 2: Scaling Constraints

Horizontal scaling is possible but complex:

Scaling Stateful vs Stateless Services
Operation	Stateless	Stateful
Add instance	Immediate, no preparation	Must configure affinity, may need state preload
Remove instance	Terminate immediately	Drain connections, migrate state, then terminate
Auto-scale trigger	CPU/request count	Complex: connections, memory, state size
Scale-down risk	None	State loss if not handled, user disruption
Load balancing	Any algorithm works	Must use affinity-aware algorithms

Challenge 3: Observability and Debugging

Debugging stateful systems is harder because behavior depends on accumulated state:

Non-reproducible issues — "It only happens after 100 requests in a specific sequence"
State corruption — Bugs may silently corrupt state, manifesting later
Memory growth — State accumulation can cause memory leaks
Per-instance metrics diverge — Each server has different state, different behavior
Distributed debugging — Following a user's journey across reconnections requires correlation

The Operational Burden

Hybrid Approaches: Best of Both Worlds

Most production systems use hybrid approaches—stateless services for general request handling with stateful components for specific needs.

Pattern 1: Stateless API Tier + Stateful Connection Tier

hybrid-architecture.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Hybrid Architecture Example
 
// STATELESS: API servers handle business logic
// No local state, horizontally scalable
class OrderApiServer {
  async createOrder(request: CreateOrderRequest): Promise<Order> {
    // All state in databases/caches
    const order = await database.orders.create(request);
    
    // Notify stateful tier about new order
    await pubsub.publish('order-created', order);
    
    return order;
  }
}
 
// STATEFUL: WebSocket servers handle real-time connections
// Holds connection state, requires affinity
class RealtimeServer {
  private connections = new Map<string, WebSocket>();
  
  constructor() {
    // Subscribe to events from stateless tier
    pubsub.subscribe('order-created', (order) => {
      // Find connected users who should see this order
      const userId = order.userId;
      const ws = this.connections.get(userId);
      if (ws) {
        ws.send(JSON.stringify({ type: 'order-created', order }));
      }
    });
  }
  
  handleConnection(ws: WebSocket, userId: string) {
    this.connections.set(userId, ws);
    ws.on('close', () => this.connections.delete(userId));
  }
}
 
// ARCHITECTURE:
// [Clients] --> [Load Balancer] --> [Stateless API Servers] --> [Database]
//                    |                                              |
//                    v                                         [Pub/Sub]
//           [Stateful WS Servers] <---------------------------------+

Pattern 2: Stateless with Distributed Cache

Services are stateless, but frequently accessed state is cached locally with eventual consistency:

Benefits of This Hybrid

•API servers remain stateless — Easy scaling, deployment, failure recovery
•Local caching improves performance — Frequently accessed data in memory
•Cache invalidation via pub/sub — State changes propagate to all caches
•Graceful degradation — Cache miss falls back to external store
•Lower external store load — Cache absorbs read traffic

The Pragmatic Path

Summary: The Case for Statefulness

We've explored stateful architecture comprehensively—when it's needed, how it works, and the operational considerations it demands. Let's consolidate the key insights:

Key Takeaways

•Statefulness means servers hold client-specific state — Session affinity is required, and state loss is possible on failure.
•Legitimate use cases exist — WebSockets, game servers, real-time analytics, workflow orchestration, ML inference.
•Types of state vary — Connection state, session state, computation state, actor state—each with different characteristics.
•Session affinity is the defining operational characteristic — Clients must return to the same server holding their state.
•Replication strategies address reliability — Active-passive, CRDTs, checkpointing—choose based on consistency needs.
•Operational burden is significant — Deployments, scaling, debugging are all more complex than stateless equivalents.
•Hybrid approaches are common and pragmatic — Stateless for most processing, stateful only where truly required.

What's next:

Page Complete

2 / 5