Loading learning content...
In the previous page, we explored the elegance and power of stateless architecture. Statelessness is often the default recommendation for scalable systems—and rightfully so. But architecture is about trade-offs, not absolutes.
Some systems genuinely require stateful services. Not as a compromise, not as technical debt, but as the correct architectural choice given their requirements. Understanding when and how to embrace statefulness is just as important as understanding statelessness.
This page takes you deep into stateful architecture: its definition, legitimate use cases, implementation patterns, operational implications, and the design decisions that make stateful systems manageable at scale.
By the end of this page, you will understand stateful architecture thoroughly: why it exists, when it's necessary, how to implement it correctly, and how leading companies run stateful systems at scale. You'll be equipped to make informed choices about when state should live on your servers.
Statefulness is an architectural property where servers maintain information about client sessions or interactions across multiple requests. The server "remembers" previous interactions, and this memory affects how subsequent requests are processed.
Let's formalize this definition:
Stateful Service Definition: A service is stateful if it maintains client-specific or session-specific data in local memory or storage, and this data influences the processing of subsequent requests from that client.
This definition has critical implications:
Stateful services are NOT simply services that interact with databases. A service that reads user data from PostgreSQL is not stateful—it's stateless with external storage. Statefulness specifically means the server instance itself maintains state that would be lost if that instance disappeared.
The State Location Spectrum:
Understanding statefulness requires clarity about where state lives:
Client-side state — Stored in browser/app, sent with each request (cookies, tokens). Server is stateless.
External shared state — Stored in databases/Redis accessible by all instances. Server is stateless.
Server-local state — Stored in the specific server instance's memory/storage. This makes the service stateful.
Hybrid state — Some state local, some external. Partially stateful.
True statefulness involves option 3 or 4—state that is bound to a specific server instance.
If statelessness is so advantageous, why would anyone choose statefulness? Because certain requirements are fundamentally better served—or only achievable—with stateful architectures.
Use Case 1: Real-Time Bidirectional Communication
WebSocket connections, server-sent events, and gRPC streaming all maintain long-lived connections with client-specific state:
12345678910111213141516171819202122232425262728293031323334353637
// WebSocket connections are inherently statefulclass ChatServer { // State: Map of active connections per user // This state lives in THIS server instance's memory private connections = new Map<string, WebSocket[]>(); // State: Per-connection metadata private connectionMeta = new WeakMap<WebSocket, ConnectionMetadata>(); handleConnection(ws: WebSocket, userId: string) { // Store connection state locally if (!this.connections.has(userId)) { this.connections.set(userId, []); } this.connections.get(userId)!.push(ws); // Store metadata about this specific connection this.connectionMeta.set(ws, { userId, connectedAt: Date.now(), lastHeartbeat: Date.now(), messageCount: 0 // Accumulating state across messages }); } sendToUser(userId: string, message: Message) { // Can only send to users connected to THIS server // This is the fundamental limitation of stateful services const userConnections = this.connections.get(userId); if (userConnections) { userConnections.forEach(ws => ws.send(JSON.stringify(message))); } // If user is connected to a DIFFERENT server, we can't reach them // This requires additional coordination (pub/sub, etc.) }}Use Case 2: In-Memory Computation and Caching
Some systems require holding large datasets or computation state in memory for performance:
Use Case 3: Workflow and Saga Orchestration
Long-running workflows often maintain state about multi-step processes:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// Workflow orchestrator maintaining execution stateclass OrderFulfillmentOrchestrator { // State: Active workflow executions private activeWorkflows = new Map<string, WorkflowState>(); async executeWorkflow(orderId: string) { // Create and track workflow state const state: WorkflowState = { orderId, step: 'payment', startedAt: Date.now(), compensations: [], // For rollback if needed retryCount: 0 }; this.activeWorkflows.set(orderId, state); try { // Step 1: Payment state.step = 'payment'; const paymentResult = await this.processPayment(orderId); state.compensations.push(() => this.refundPayment(paymentResult.id)); // Step 2: Inventory state.step = 'inventory'; const reservationResult = await this.reserveInventory(orderId); state.compensations.push(() => this.releaseInventory(reservationResult.id)); // Step 3: Shipping state.step = 'shipping'; await this.initiateShipping(orderId); // Success state.step = 'completed'; } catch (error) { // Saga compensation - rollback in reverse order // Relies on state accumulated during execution for (const compensation of state.compensations.reverse()) { await compensation(); } state.step = 'rolled_back'; } } getWorkflowStatus(orderId: string): WorkflowState | undefined { // Can only check workflows running on THIS server return this.activeWorkflows.get(orderId); }}All legitimate statefulness use cases share a pattern: the cost of not having local state (latency, complexity, or impossibility) exceeds the cost of managing local state (scaling complexity, failure handling). When externalizing state would make the system unusable or impossible, statefulness is the correct choice.
Not all server state is equal. Understanding the different types of state helps you make better decisions about what state to keep locally and how to manage it.
1. Connection State
The most common form of server state in modern systems. Each active connection has associated state:
| Protocol | Connection State Includes | Lifetime |
|---|---|---|
| WebSocket | Socket handle, authentication context, subscriptions, pending messages | Session duration (minutes to hours) |
| HTTP/2 | Stream multiplexing state, flow control windows, header compression context | Connection duration (seconds to minutes) |
| gRPC Streaming | Stream state, buffered messages, cancellation tokens | Stream duration (variable) |
| Database Connection | Transaction state, prepared statements, session variables | Request or connection pool lifetime |
2. Session State
Information about an ongoing user session that spans multiple requests:
3. Computation State
State that represents ongoing or cached computations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// Examples of computation state in stateful services class StreamingAggregator { // State: Sliding window for real-time metrics private windows = new Map<string, SlidingWindow>(); // This state is the computation - it cannot be externalized // without destroying the real-time property recordEvent(metricId: string, value: number, timestamp: number) { const window = this.windows.get(metricId) ?? new SlidingWindow(60_000); window.add(value, timestamp); this.windows.set(metricId, window); } getAverage(metricId: string): number { return this.windows.get(metricId)?.average() ?? 0; }} class GamePhysicsEngine { // State: Entity positions, velocities, collision state private entities = new Map<string, PhysicsEntity>(); // State: Spatial partitioning structure for collision detection private spatialHash = new SpatialHashGrid(100); // This state represents the simulation - must be in memory tick(deltaTime: number) { for (const entity of this.entities.values()) { // Update positions entity.position.x += entity.velocity.x * deltaTime; entity.position.y += entity.velocity.y * deltaTime; // Update spatial hash this.spatialHash.update(entity.id, entity.position); } // Collision detection using spatial hash this.resolveCollisions(); }} class MLModelServer { // State: Large neural network weights loaded in memory private model: NeuralNetwork; // State: Batch buffer for efficient inference private batchBuffer: InferenceRequest[] = []; private batchTimer: NodeJS.Timeout | null = null; // Loading model from disk takes seconds - must stay in memory async initialize() { this.model = await loadModel('models/recommendation-v3.onnx'); // 500MB+ model now in server memory } async infer(input: Tensor): Promise<Tensor> { return this.model.forward(input); }}4. Actor/Entity State
In actor-model systems (Akka, Orleans, Dapr), actors encapsulate state:
Actor models (Microsoft Orleans, Akka) embrace statefulness deliberately. Each actor holds its state in memory, and the framework handles actor placement, migration, and persistence. This is 'managed statefulness'—the complexity is handled by infrastructure rather than application code.
The defining operational characteristic of stateful services is session affinity (also called "sticky sessions"). Clients must be consistently routed to the server instance holding their state.
How Session Affinity Works:
Load balancers implement affinity through various mechanisms:
| Mechanism | How It Works | Pros | Cons |
|---|---|---|---|
| Cookie-based | LB sets/reads cookie with server ID | Works through NAT, browser handles persistence | Cookie size limits, requires cookie support |
| IP Hash | Hash of client IP determines server | No cookies needed, works for any protocol | Breaks with NAT/proxy, mobile IP changes |
| Header-based | Custom header indicates target server | Flexible, controllable by application | Requires application cooperation |
| Connection-based | Same TCP connection stays on same server | Perfect for WebSocket/streams | New connections may hit different server |
| Consistent Hashing | Hash of session ID to server ring | Handles server changes gracefully | More complex implementation |
12345678910111213141516171819202122232425262728293031323334353637383940
# NGINX configuration for session affinity # Method 1: IP Hash - Simple but imperfectupstream backend_ip_hash { ip_hash; server backend1.example.com:8080; server backend2.example.com:8080; server backend3.example.com:8080;} # Method 2: Cookie-based sticky sessions (NGINX Plus)upstream backend_sticky { # Creates/reads cookie named 'srv_id' with 1-hour expiry sticky cookie srv_id expires=1h domain=.example.com path=/; server backend1.example.com:8080; server backend2.example.com:8080; server backend3.example.com:8080;} # Method 3: Consistent hashing on a header/variableupstream backend_consistent { hash $cookie_session_id consistent; server backend1.example.com:8080; server backend2.example.com:8080; server backend3.example.com:8080;} server { listen 80; location /api/ { proxy_pass http://backend_sticky; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; # WebSocket support for stateful connections proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; }}The Session Affinity Problem: Server Failure
The critical weakness of session affinity is server failure handling. When a server holding state dies:
Every stateful system must answer: 'What happens when a server holding state fails?' The answer—replication, persistence, graceful degradation, or accepting loss—defines your reliability story. This 'reliability tax' is unavoidable with statefulness.
To achieve reliability in stateful systems, state must be replicated or persisted. Several strategies exist, each with different trade-offs.
Strategy 1: Active-Passive Replication
One server (primary) holds active state; others (secondaries) receive replicated copies:
1234567891011121314151617181920212223242526272829303132
// Active-Passive replication patternclass ReplicatedSessionStore { private localState = new Map<string, SessionData>(); private role: 'primary' | 'secondary'; private peerNodes: PeerConnection[]; async updateSession(sessionId: string, data: SessionData) { if (this.role === 'secondary') { throw new Error('Cannot update on secondary node'); } // Update local state this.localState.set(sessionId, data); // Replicate to secondaries (sync or async) const replicationPromises = this.peerNodes.map( peer => peer.replicate(sessionId, data) ); // Choice: Wait for replication (consistent but slower) // or continue immediately (faster but risk data loss) await Promise.all(replicationPromises); // Sync replication return data; } async promoteToLeader() { // Called when primary fails and this node takes over this.role = 'primary'; // State is already local from replication }}Strategy 2: Multi-Primary / Conflict Resolution
All nodes can accept writes; conflicts are resolved via CRDTs or last-write-wins:
123456789101112131415161718192021222324252627282930313233343536
// Multi-primary with CRDT for conflict resolutionclass DistributedCounter { // CRDT: G-Counter (grow-only counter) // Each node tracks its own increments private nodeId: string; private counters = new Map<string, number>(); constructor(nodeId: string) { this.nodeId = nodeId; this.counters.set(nodeId, 0); } increment() { // Only increment this node's counter const current = this.counters.get(this.nodeId) ?? 0; this.counters.set(this.nodeId, current + 1); } // Merge function - CRDT magic // No conflict possible: always take max merge(other: Map<string, number>) { for (const [nodeId, count] of other) { const myCount = this.counters.get(nodeId) ?? 0; this.counters.set(nodeId, Math.max(myCount, count)); } } // Read total across all nodes value(): number { let total = 0; for (const count of this.counters.values()) { total += count; } return total; }}Strategy 3: Periodic Checkpointing
State is periodically written to durable storage for recovery:
| Strategy | Consistency | Availability | Complexity | Best For |
|---|---|---|---|---|
| Active-Passive (Sync) | Strong | Lower during failover | Medium | Financial, order processing |
| Active-Passive (Async) | Eventual (risk of loss) | High | Medium | Gaming, sessions |
| Multi-Primary CRDT | Eventual (convergent) | High | High | Counters, sets, distributed data |
| Checkpointing | Point-in-time | High (between checkpoints) | Low | Analytics, batch processing |
| In-Memory Grid | Configurable | High | Medium | Caching, session stores |
The right replication strategy depends on your consistency requirements and tolerance for data loss. Financial systems demand synchronous replication. Gaming systems might accept occasional state loss for lower latency. Know your domain's requirements.
Let's examine how leading technology companies implement stateful architectures at scale.
Discord: WebSocket Servers at Scale
Discord handles millions of concurrent WebSocket connections—a fundamentally stateful workload:
| Company | Stateful Component | Scale | State Management |
|---|---|---|---|
| Discord | WebSocket gateways | Millions of connections | Pub/sub for cross-server events, session resumption |
| Slack | Real-time messaging servers | Millions of connected clients | Channel-based sharding, presence state |
| Epic Games (Fortnite) | Game servers | 100 players/server | In-memory game state, eventual storage |
| Twitch | Chat servers | Hundreds of thousands/channel | Connection state, IRC-style fanout |
| Microsoft Orleans | Actor-based services | Millions of actors | Virtual actors, automatic persistence |
Epic Games: Fortnite Game Servers
Fortnite game servers are intensely stateful:
Microsoft Orleans: Virtual Actors
Orleans (used by Halo, Azure services) provides a framework for stateful services:
12345678910111213141516171819202122232425262728293031323334353637383940414243
// Orleans Virtual Actor (Grain) - Managed Statefulnesspublic interface IPlayerGrain : IGrainWithStringKey{ Task<PlayerState> GetState(); Task UpdatePosition(Vector3 position); Task AddItem(Item item);} public class PlayerGrain : Grain, IPlayerGrain{ // STATE: Held in memory by this grain instance // Orleans handles activation, placement, and persistence private PlayerState _state; public override async Task OnActivateAsync() { // Load state from storage on first access _state = await ReadStateFromStorageAsync(); } public Task<PlayerState> GetState() => Task.FromResult(_state); public async Task UpdatePosition(Vector3 position) { // Update in-memory state _state.Position = position; _state.LastUpdate = DateTime.UtcNow; // Orleans persists state periodically or on-demand // Framework handles all the complexity } public async Task AddItem(Item item) { _state.Inventory.Add(item); // State persists across server restarts await WriteStateToStorageAsync(); }} // Usage - Orleans routes to correct server transparentlyvar player = grainFactory.GetGrain<IPlayerGrain>("player-12345");await player.UpdatePosition(newPosition); // Routed to stateful grainFrameworks like Orleans, Akka, and Dapr provide 'managed statefulness'—you write stateful logic, and the framework handles placement, replication, persistence, and failover. This dramatically reduces the operational burden of stateful services.
Stateful services demand significantly more operational investment than stateless services. Understanding these challenges helps you prepare appropriately.
Challenge 1: Deployment Complexity
You can't simply roll out new instances and kill old ones. You must handle in-flight state:
Challenge 2: Scaling Constraints
Horizontal scaling is possible but complex:
| Operation | Stateless | Stateful |
|---|---|---|
| Add instance | Immediate, no preparation | Must configure affinity, may need state preload |
| Remove instance | Terminate immediately | Drain connections, migrate state, then terminate |
| Auto-scale trigger | CPU/request count | Complex: connections, memory, state size |
| Scale-down risk | None | State loss if not handled, user disruption |
| Load balancing | Any algorithm works | Must use affinity-aware algorithms |
Challenge 3: Observability and Debugging
Debugging stateful systems is harder because behavior depends on accumulated state:
Stateful services require more sophisticated operations: custom deployment strategies, state migration tooling, connection draining logic, and state-aware monitoring. Only choose statefulness when the benefits outweigh this significant operational investment.
Most production systems use hybrid approaches—stateless services for general request handling with stateful components for specific needs.
Pattern 1: Stateless API Tier + Stateful Connection Tier
1234567891011121314151617181920212223242526272829303132333435363738394041424344
// Hybrid Architecture Example // STATELESS: API servers handle business logic// No local state, horizontally scalableclass OrderApiServer { async createOrder(request: CreateOrderRequest): Promise<Order> { // All state in databases/caches const order = await database.orders.create(request); // Notify stateful tier about new order await pubsub.publish('order-created', order); return order; }} // STATEFUL: WebSocket servers handle real-time connections// Holds connection state, requires affinityclass RealtimeServer { private connections = new Map<string, WebSocket>(); constructor() { // Subscribe to events from stateless tier pubsub.subscribe('order-created', (order) => { // Find connected users who should see this order const userId = order.userId; const ws = this.connections.get(userId); if (ws) { ws.send(JSON.stringify({ type: 'order-created', order })); } }); } handleConnection(ws: WebSocket, userId: string) { this.connections.set(userId, ws); ws.on('close', () => this.connections.delete(userId)); }} // ARCHITECTURE:// [Clients] --> [Load Balancer] --> [Stateless API Servers] --> [Database]// | |// v [Pub/Sub]// [Stateful WS Servers] <---------------------------------+Pattern 2: Stateless with Distributed Cache
Services are stateless, but frequently accessed state is cached locally with eventual consistency:
Most successful systems are hybrids. Stateless for the bulk of processing, stateful only where truly required. This minimizes operational complexity while still supporting real-time features, in-memory computation, or other stateful needs.
We've explored stateful architecture comprehensively—when it's needed, how it works, and the operational considerations it demands. Let's consolidate the key insights:
What's next:
Now that we understand both stateless and stateful architectures, we'll explore the implications for scaling. How do these architectural choices affect your ability to handle growth? What constraints emerge as you scale each type of service? The next page dives deep into the scaling calculus of stateless vs stateful systems.
You now understand stateful architecture thoroughly—its definition, legitimate use cases, state types, session affinity requirements, replication strategies, real-world implementations, and operational challenges. Combined with the previous page on statelessness, you can now reason clearly about where state should live in your systems.