System Design (HLD)BASE Properties

BASE Properties: The Alternative to ACID

LevelIntermediate

Duration75 mins

TopicBASE Properties

2 / 5

Soft State

The Illusion of Permanence

In traditional database thinking, data written to a database stays exactly as written until explicitly modified. A customer's address, once stored, remains unchanged until the customer submits an update form. An account balance, once recorded, sits frozen until a transaction moves funds. This is the model of hard state—data is permanent, durable, and static.

Distributed systems operating under the BASE model work fundamentally differently. Here, state is soft—data may change without explicit input from users or applications. Your view of the data right now might differ from your view a moment from now, even if no one 'touched' it.

This isn't a bug. It's a feature. And understanding soft state is essential for designing systems that scale to planetary proportions.

What You Will Learn

By the end of this page, you will understand what soft state means in distributed systems, why it's a necessary consequence of choosing availability over strong consistency, the mechanisms that cause state to change 'on its own,' and how to design applications that embrace soft state rather than being broken by it.

Defining Soft State

Soft state means that the state of a system may change over time, even without input. This change typically occurs due to eventual consistency mechanisms that propagate updates across replicas, resolve conflicts, and converge toward a consistent view.

To understand soft state, contrast it with hard state:

Hard State vs. Soft State
Characteristic	Hard State (ACID)	Soft State (BASE)
Durability	Once written, data persists unchanged	Data may be updated by background processes
Consistency	Immediately consistent across all replicas	Temporarily inconsistent across replicas
Predictability	Read always returns exactly what was last written	Read may return different values at different times
Time dependency	State is independent of when you read it	State depends on when you read it
User control	Only user actions modify data	System processes may modify visible data

Soft Doesn't Mean Unreliable

The term 'soft' might suggest fragility or lack of durability, but that's not the case. Soft state systems can be highly durable and reliable. 'Soft' refers to the mutability of the observable state over time—not to the underlying storage reliability. Your data is safe; it's just that different parts of the system might temporarily have different views of it.

Why State Becomes 'Soft':

Soft state is a direct consequence of two design decisions that enable basic availability:

Asynchronous Replication: Updates to one replica don't immediately appear on other replicas. The system propagates changes over time.
Conflict Resolution: When the same data is modified on multiple replicas (e.g., during a network partition), the system must resolve these conflicts, potentially changing the 'final' value from what any single write requested.

These mechanisms mean that when you read data, you might see:

The value before a recent write (replication delay)
A different value than another reader seeing the same data from a different replica
A merged/resolved value that differs from any individual write

The state is 'soft' because it's in flux—constantly moving toward consistency but never guaranteed to be fully consistent at any given moment.

Mechanisms That Create Soft State

Several technical mechanisms in distributed systems contribute to soft state. Understanding these mechanisms is crucial for predicting how your system will behave and designing applications that work correctly despite state softness.

Mechanisms Creating Soft State

•Replication Lag — The time between a write completing on one replica and becoming visible on other replicas. Can range from milliseconds to seconds (or longer during failures).
•Anti-Entropy Processes — Background processes that compare replicas and propagate differences. These run periodically, causing data to change between runs.
•Read Repair — When a read detects inconsistency between replicas, it triggers an update. A read can cause a write!
•Merkle Trees and Hash Comparisons — Efficient mechanisms for detecting differences between replicas, triggering synchronization that changes observable state.
•Conflict Resolution — When concurrent writes conflict, resolution algorithms (LWW, vector clocks, CRDTs) determine the final value, potentially changing data from any individual write.
•TTL and Expiration — Data with time-to-live settings automatically changes (disappears) when TTL expires, without user action.
•Compaction and Garbage Collection — Background processes that clean up data, potentially changing what's observable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Demonstration of soft state through replication lag
 
interface Replica {
  id: string;
  write(key: string, value: any, timestamp: number): Promise<void>;
  read(key: string): Promise<{ value: any; timestamp: number } | null>;
}
 
class SoftStateDemo {
  private replicas: Replica[];
  
  constructor(replicas: Replica[]) {
    this.replicas = replicas;
  }
  
  async demonstrateSoftState() {
    const key = 'user:123:email';
    
    // Write to primary replica
    await this.replicas[0].write(key, 'new@email.com', Date.now());
    console.log('Write completed on primary replica');
    
    // Immediately read from all replicas
    console.log('\nReading immediately after write:');
    for (const replica of this.replicas) {
      const result = await replica.read(key);
      console.log(`  Replica ${replica.id}: ${result?.value}`);
    }
    // Output might be:
    // Replica primary: new@email.com
    // Replica secondary-1: old@email.com    <-- SOFT STATE!
    // Replica secondary-2: old@email.com    <-- SOFT STATE!
    
    // Wait for replication
    await sleep(1000);
    
    // Read again - state has "changed" without new writes
    console.log('\nReading after replication delay:');
    for (const replica of this.replicas) {
      const result = await replica.read(key);
      console.log(`  Replica ${replica.id}: ${result?.value}`);
    }
    // Output:
    // Replica primary: new@email.com
    // Replica secondary-1: new@email.com    <-- State changed!
    // Replica secondary-2: new@email.com    <-- State changed!
  }
}
 
// The observable state changed between reads,
// even though no new writes occurred.
// This is soft state in action.

Read Repair: Reads That Write

One of the most counterintuitive aspects of soft state is read repair. In many distributed databases, when a read operation detects inconsistency between replicas, it triggers a repair—writing the correct value to out-of-date replicas.

This means:

You issue a read-only query
The system detects that replicas disagree
The system writes to some replicas to make them consistent
Your read returns

From the perspective of other readers, data 'changed' as a side effect of your read. This is soft state in action.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Read repair in a quorum-based system
 
async function readWithRepair(
  key: string,
  replicas: Replica[],
  readQuorum: number
): Promise<any> {
  // Read from all replicas in parallel
  const responses = await Promise.all(
    replicas.map(async (replica) => {
      try {
        const result = await replica.read(key);
        return { replica, result, success: true };
      } catch (error) {
        return { replica, result: null, success: false };
      }
    })
  );
  
  const successful = responses.filter(r => r.success && r.result);
  
  if (successful.length < readQuorum) {
    throw new Error('Read quorum not achieved');
  }
  
  // Find the most recent value (highest timestamp wins)
  const latest = successful.reduce((a, b) => 
    (a.result.timestamp > b.result.timestamp) ? a : b
  );
  
  // Identify replicas with stale data
  const staleReplicas = successful.filter(
    r => r.result.timestamp < latest.result.timestamp
  );
  
  // Trigger repair in background (don't await)
  if (staleReplicas.length > 0) {
    triggerReadRepair(key, latest.result, staleReplicas);
    console.log(`Read repair triggered for ${staleReplicas.length} replicas`);
    // Note: This read operation just caused writes!
    // Other readers will see "changed" data as a result.
  }
  
  return latest.result.value;
}
 
async function triggerReadRepair(
  key: string, 
  correctValue: { value: any; timestamp: number },
  staleReplicas: { replica: Replica }[]
) {
  // Background repair - updates stale replicas
  for (const { replica } of staleReplicas) {
    replica.write(key, correctValue.value, correctValue.timestamp)
      .catch(err => console.error(`Repair failed for ${replica.id}`, err));
  }
}

Implications for Application Design

Soft state has profound implications for how we design applications. Code that assumes hard state—that data won't change unless explicitly modified—will break in subtle and difficult-to-debug ways. Here's how to design for soft state.

Design Principles for Soft State

•Never cache derived state indefinitely — Data fetched and processed might become stale. Implement TTL-based expiration for all caches.
•Design for idempotency — Operations may be retried due to uncertainty about state. Ensure repeated operations have the same effect as a single operation.
•Use version numbers or timestamps — Track data freshness explicitly. Let clients know how old data is and decide if it's acceptable.
•Implement optimistic concurrency — Check that data hasn't changed before applying updates. Use version vectors or timestamps to detect conflicts.
•Design for conflict resolution — Don't assume writes will 'just work.' Plan for conflicts and implement resolution strategies.
•Embrace eventual consistency in UI — Show users that data may be updating. Use loading states, timestamps, and refresh mechanisms.

The Stale Data Trap

One of the most common bugs in soft-state systems is caching data locally and assuming it remains valid. A user's session, a product's inventory, a document's content—all of these can change between when you read them and when you use them. Always re-validate at critical decision points, especially before commits or transactions.

Anti-Pattern: Read-Modify-Write Without Versioning

Consider this common pattern that breaks with soft state:

1. Read current inventory: 100 units
2. User adds 10 units to cart
3. Calculate new inventory: 100 - 10 = 90
4. Write new inventory: 90

The problem: between steps 1 and 4, another process might have updated inventory. You might overwrite their change, or they might overwrite yours.

Correct Pattern: Conditional Update with Versioning

1. Read current inventory: 100 units, version: 5
2. User adds 10 units to cart
3. Calculate new inventory: 100 - 10 = 90
4. Write new inventory: 90, ONLY IF version still 5
5. If version changed, re-read and retry

This pattern respects soft state by acknowledging that the data might have changed and handling that case explicitly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Handle soft state with optimistic concurrency control
 
interface VersionedData<T> {
  value: T;
  version: number;
  lastModified: Date;
}
 
async function updateWithOptimisticLocking<T>(
  key: string,
  updateFn: (current: T) => T,
  maxRetries: number = 3
): Promise<VersionedData<T>> {
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    // Read current state with version
    const current = await db.readWithVersion<T>(key);
    
    // Apply update function
    const newValue = updateFn(current.value);
    
    try {
      // Attempt conditional write
      const result = await db.writeIfVersion(
        key, 
        newValue, 
        current.version
      );
      
      return result; // Success!
      
    } catch (error) {
      if (error instanceof VersionConflictError) {
        // State changed between read and write (soft state!)
        console.log(`Version conflict on attempt ${attempt + 1}, retrying...`);
        
        // Add exponential backoff for high-contention scenarios
        await sleep(Math.pow(2, attempt) * 100);
        continue;
      }
      throw error; // Unexpected error
    }
  }
  
  throw new Error(`Failed to update ${key} after ${maxRetries} attempts`);
}
 
// Usage: Safely decrement inventory
async function reserveInventory(productId: string, quantity: number) {
  return await updateWithOptimisticLocking<InventoryRecord>(
    `inventory:${productId}`,
    (current) => {
      if (current.available < quantity) {
        throw new InsufficientInventoryError();
      }
      return {
        ...current,
        available: current.available - quantity,
        reserved: current.reserved + quantity
      };
    }
  );
}

TTL and Expiring State

A common manifestation of soft state is time-to-live (TTL) expiration. Data stored with a TTL will automatically disappear when the time expires—a clear example of state changing without explicit modification.

TTLs are used extensively in distributed systems for:

Session management: User sessions expire after inactivity
Caching: Cached data expires to force refresh
Rate limiting: Rate limit counters reset after time windows
Distributed locks: Locks expire to prevent deadlocks
Temporary data: One-time codes, verification tokens

Common TTL Patterns
Use Case	Typical TTL	Why It Works
Session tokens	15-30 minutes	Balance security (short) with UX (long enough)
API response cache	5-60 seconds	Reduce load while keeping data reasonably fresh
CDN cache	1 hour - 1 day	Edge cache benefits outweigh staleness cost
Rate limit windows	1 second - 1 hour	Match rate limit policy granularity
Distributed locks	30-60 seconds	Long enough for operation, short enough to recover
DNS cache	5 minutes - 24 hours	Reduce DNS lookups while allowing updates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Redis-style operations with TTL
 
class DistributedCache {
  private redis: RedisClient;
  
  // Set with TTL - data will auto-expire (soft state!)
  async setWithTTL(
    key: string, 
    value: any, 
    ttlSeconds: number
  ): Promise<void> {
    await this.redis.setex(key, ttlSeconds, JSON.stringify(value));
  }
  
  // Get with TTL check - handle expiration gracefully
  async getWithFallback<T>(
    key: string,
    fallbackFn: () => Promise<T>,
    ttlSeconds: number
  ): Promise<T> {
    const cached = await this.redis.get(key);
    
    if (cached !== null) {
      return JSON.parse(cached) as T;
    }
    
    // Cache miss or expired - soft state changed!
    // Fetch fresh data and cache it
    const fresh = await fallbackFn();
    await this.setWithTTL(key, fresh, ttlSeconds);
    return fresh;
  }
  
  // Sliding window pattern - extend TTL on access
  async getWithSlidingExpiry<T>(
    key: string,
    ttlSeconds: number
  ): Promise<T | null> {
    const value = await this.redis.get(key);
    
    if (value !== null) {
      // Reset TTL on every access - keeps active data alive
      await this.redis.expire(key, ttlSeconds);
      return JSON.parse(value) as T;
    }
    
    return null;
  }
}
 
// Example: Session management with soft state
class SessionManager {
  private cache: DistributedCache;
  private readonly SESSION_TTL = 30 * 60; // 30 minutes
  
  async getSession(sessionId: string): Promise<Session | null> {
    // Session might expire (change to null) at any moment
    // This is soft state - the session "changes" when TTL expires
    const session = await this.cache.getWithSlidingExpiry<Session>(
      `session:${sessionId}`,
      this.SESSION_TTL
    );
    
    if (!session) {
      // Session expired or never existed
      // Application must handle this gracefully
      return null;
    }
    
    return session;
  }
}

TTL as a Feature, Not a Bug

TTL-based expiration elegantly handles many cleanup problems. Instead of building complex background jobs to delete old sessions, rate limit counters, or temporary data, let the TTL mechanism do it automatically. This is soft state working in your favor—embrace it.

Conflict Resolution and Merging

When multiple replicas accept writes independently (during network partitions or in multi-leader/leaderless systems), conflicts occur. The resolution of these conflicts is another source of soft state—the 'final' value might differ from what any writer wrote.

Common Conflict Resolution Strategies

•Last-Writer-Wins (LWW) — The write with the latest timestamp wins. Simple but can lose data silently.
•First-Writer-Wins — The first write is preserved, later writes to the same key are ignored.
•Application-Level Resolution — Conflicts are stored, and application logic decides how to merge them.
•Conflict-Free Replicated Data Types (CRDTs) — Specially designed data structures that merge automatically without conflicts.
•Operational Transformation (OT) — Transform conflicting operations so they can be applied in any order.
•Vector Clocks — Track causality to detect and handle concurrent writes correctly.

Deep Dive: Last-Writer-Wins (LWW)

LWW is the simplest and most common conflict resolution strategy. Each write includes a timestamp, and when replicas sync, the write with the highest timestamp wins.

Advantages:

Simple to implement and understand
Automatically converges to a single value
No conflict storage or resolution logic needed

Disadvantages:

Can silently lose data
Depends on synchronized clocks (problematic in distributed systems)
Doesn't capture user intent

Example Scenario:

Time T1: User A updates product name to "Widget Pro"
Time T2: User B updates product name to "Widget Plus" 
         (on a different replica, didn't see A's change)
Time T3: Replicas sync - "Widget Plus" wins (later timestamp)

Result: User A's change is silently lost.
User A sees "Widget Plus" and is confused.

This is soft state in action—User A wrote "Widget Pro" and later sees "Widget Plus" without having made any change.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
// G-Counter CRDT: A counter that only grows
// Multiple replicas can increment independently
// Always merges correctly without conflicts
 
class GCounter {
  private counts: Map<string, number>;
  private nodeId: string;
  
  constructor(nodeId: string) {
    this.nodeId = nodeId;
    this.counts = new Map();
  }
  
  // Increment only affects local node's count
  increment(): void {
    const current = this.counts.get(this.nodeId) || 0;
    this.counts.set(this.nodeId, current + 1);
  }
  
  // Get current value (sum of all nodes)
  value(): number {
    let total = 0;
    for (const count of this.counts.values()) {
      total += count;
    }
    return total;
  }
  
  // Merge with another replica - ALWAYS succeeds!
  merge(other: GCounter): void {
    for (const [nodeId, count] of other.counts.entries()) {
      const myCount = this.counts.get(nodeId) || 0;
      // Take maximum - this is mathematically guaranteed to converge
      this.counts.set(nodeId, Math.max(myCount, count));
    }
  }
  
  // Export state for replication
  toState(): Map<string, number> {
    return new Map(this.counts);
  }
}
 
// Usage example: Page view counter
// Works correctly even with network partitions
 
const nodeA = new GCounter('node-a');
const nodeB = new GCounter('node-b');
 
// Both nodes receive page views independently
nodeA.increment(); // View on node A
nodeA.increment(); // View on node A
nodeB.increment(); // View on node B
 
console.log('Before merge:');
console.log(`  Node A sees: ${nodeA.value()}`); // 2
console.log(`  Node B sees: ${nodeB.value()}`); // 1
 
// Network partition heals, nodes sync
nodeA.merge(nodeB);
nodeB.merge(nodeA);
 
console.log('After merge:');
console.log(`  Node A sees: ${nodeA.value()}`); // 3
console.log(`  Node B sees: ${nodeB.value()}`); // 3
 
// Perfect convergence! No conflicts, no data loss.
// This is soft state that "changes" but always correctly.

CRDTs: The Gold Standard for Soft State

Conflict-Free Replicated Data Types represent the pinnacle of soft state management. They're mathematically proven to always converge to the same value regardless of the order of operations. Technologies like Redis (with certain data structures), Riak, and collaborative editing tools (like Google Docs internals) use CRDT-based approaches. When soft state is unavoidable, CRDTs make it predictable.

Monitoring and Debugging Soft State

Debugging issues in soft-state systems is notoriously difficult. The bug you're trying to reproduce might depend on timing, replication lag, or conflict resolution orders that are hard to recreate. Here are strategies for monitoring and debugging soft state:

Monitoring Strategies

•Replication lag metrics — Track delay between primary writes and replica visibility. Alert on excessive lag.
•Conflict rate monitoring — Track how often conflicts occur and how they're resolved. High rates may indicate design issues.
•Version/timestamp skew — Monitor clock synchronization between nodes. Large skew breaks LWW strategies.
•Consistency checks — Periodically compare replicas and log divergence. Background anti-entropy processes do this.
•Operation logging — Log writes with full context (timestamp, node, version) for debugging.
•Distributed tracing — Follow a request through multiple services to understand propagation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// Monitor replication lag across a distributed database cluster
 
interface ReplicationMetrics {
  primaryWriteTime: Date;
  replicaVisibleTime: Date;
  lagMs: number;
  replicaId: string;
}
 
class ReplicationMonitor {
  private metricsStore: MetricsStore;
  
  async measureReplicationLag(): Promise<ReplicationMetrics[]> {
    const canaryKey = `_canary:${Date.now()}`;
    const writeTime = new Date();
    
    // Write to primary
    await this.primaryDb.write(canaryKey, writeTime.toISOString());
    
    const metrics: ReplicationMetrics[] = [];
    
    for (const replica of this.replicas) {
      const result = await this.pollForVisibility(replica, canaryKey, 30000);
      
      metrics.push({
        primaryWriteTime: writeTime,
        replicaVisibleTime: result.visibleAt,
        lagMs: result.visibleAt.getTime() - writeTime.getTime(),
        replicaId: replica.id
      });
    }
    
    // Record metrics
    for (const metric of metrics) {
      await this.metricsStore.recordGauge(
        'replication_lag_ms',
        metric.lagMs,
        { replica: metric.replicaId }
      );
      
      // Alert on excessive lag (soft state becoming "too soft")
      if (metric.lagMs > 5000) {
        this.alerting.warn(`Replication lag on ${metric.replicaId}: ${metric.lagMs}ms`);
      }
    }
    
    // Cleanup canary
    await this.primaryDb.delete(canaryKey);
    
    return metrics;
  }
  
  private async pollForVisibility(
    replica: Replica, 
    key: string, 
    timeoutMs: number
  ): Promise<{ visibleAt: Date }> {
    const startTime = Date.now();
    
    while (Date.now() - startTime < timeoutMs) {
      const result = await replica.read(key);
      if (result !== null) {
        return { visibleAt: new Date() };
      }
      await sleep(100); // Poll every 100ms
    }
    
    throw new Error(`Replication timeout on replica ${replica.id}`);
  }
}

Summary: Embracing Soft State

We've explored the second pillar of the BASE consistency model: Soft State. Let's consolidate the key takeaways:

Key Takeaways

•Soft state means data changes without explicit modification — Replication, conflict resolution, TTLs, and anti-entropy processes all cause observable state to change.
•Soft state is a consequence of availability — To achieve basic availability, we accept that different replicas may temporarily have different views.
•Applications must be designed for soft state — Assumptions about data permanence will break. Use versioning, optimistic concurrency, and idempotent operations.
•TTL expiration is soft state by design — Embrace TTLs for automatic cleanup of temporary data, sessions, caches, and rate limits.
•Conflict resolution creates soft state — When writes conflict, the resolved value might differ from any individual write. CRDTs provide elegant solutions.
•Monitoring is essential — Track replication lag, conflict rates, and version skew to understand how 'soft' your state is.

What's Next:

With basic availability and soft state understood, we're ready to explore the third and final pillar of BASE: Eventually Consistent. Eventual consistency describes the convergence guarantee—that given enough time without new writes, all replicas will eventually hold the same value. This is the promise that makes soft state manageable: things may be temporarily inconsistent, but they will converge.

Page Complete

You now understand what soft state means in distributed systems and how it manifests through replication lag, conflict resolution, TTL expiration, and background processes. Designing for soft state means accepting that data is in flux and building applications that handle this gracefully. Next, we'll explore eventual consistency—the guarantee that grounds soft state in predictability.

2 / 5

Loading learning content...

System Design (HLD)BASE Properties

BASE Properties: The Alternative to ACID

LevelIntermediate

Duration75 mins

TopicBASE Properties

2 / 5

Soft State

The Illusion of Permanence

This isn't a bug. It's a feature. And understanding soft state is essential for designing systems that scale to planetary proportions.

What You Will Learn

Defining Soft State

To understand soft state, contrast it with hard state:

Hard State vs. Soft State
Characteristic	Hard State (ACID)	Soft State (BASE)
Durability	Once written, data persists unchanged	Data may be updated by background processes
Consistency	Immediately consistent across all replicas	Temporarily inconsistent across replicas
Predictability	Read always returns exactly what was last written	Read may return different values at different times
Time dependency	State is independent of when you read it	State depends on when you read it
User control	Only user actions modify data	System processes may modify visible data

Soft Doesn't Mean Unreliable

Why State Becomes 'Soft':

Soft state is a direct consequence of two design decisions that enable basic availability:

Asynchronous Replication: Updates to one replica don't immediately appear on other replicas. The system propagates changes over time.
Conflict Resolution: When the same data is modified on multiple replicas (e.g., during a network partition), the system must resolve these conflicts, potentially changing the 'final' value from what any single write requested.

These mechanisms mean that when you read data, you might see:

The value before a recent write (replication delay)
A different value than another reader seeing the same data from a different replica
A merged/resolved value that differs from any individual write

The state is 'soft' because it's in flux—constantly moving toward consistency but never guaranteed to be fully consistent at any given moment.

Mechanisms That Create Soft State

Mechanisms Creating Soft State

•Replication Lag — The time between a write completing on one replica and becoming visible on other replicas. Can range from milliseconds to seconds (or longer during failures).
•Anti-Entropy Processes — Background processes that compare replicas and propagate differences. These run periodically, causing data to change between runs.
•Read Repair — When a read detects inconsistency between replicas, it triggers an update. A read can cause a write!
•Merkle Trees and Hash Comparisons — Efficient mechanisms for detecting differences between replicas, triggering synchronization that changes observable state.
•Conflict Resolution — When concurrent writes conflict, resolution algorithms (LWW, vector clocks, CRDTs) determine the final value, potentially changing data from any individual write.
•TTL and Expiration — Data with time-to-live settings automatically changes (disappears) when TTL expires, without user action.
•Compaction and Garbage Collection — Background processes that clean up data, potentially changing what's observable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Demonstration of soft state through replication lag
 
interface Replica {
  id: string;
  write(key: string, value: any, timestamp: number): Promise<void>;
  read(key: string): Promise<{ value: any; timestamp: number } | null>;
}
 
class SoftStateDemo {
  private replicas: Replica[];
  
  constructor(replicas: Replica[]) {
    this.replicas = replicas;
  }
  
  async demonstrateSoftState() {
    const key = 'user:123:email';
    
    // Write to primary replica
    await this.replicas[0].write(key, 'new@email.com', Date.now());
    console.log('Write completed on primary replica');
    
    // Immediately read from all replicas
    console.log('\nReading immediately after write:');
    for (const replica of this.replicas) {
      const result = await replica.read(key);
      console.log(`  Replica ${replica.id}: ${result?.value}`);
    }
    // Output might be:
    // Replica primary: new@email.com
    // Replica secondary-1: old@email.com    <-- SOFT STATE!
    // Replica secondary-2: old@email.com    <-- SOFT STATE!
    
    // Wait for replication
    await sleep(1000);
    
    // Read again - state has "changed" without new writes
    console.log('\nReading after replication delay:');
    for (const replica of this.replicas) {
      const result = await replica.read(key);
      console.log(`  Replica ${replica.id}: ${result?.value}`);
    }
    // Output:
    // Replica primary: new@email.com
    // Replica secondary-1: new@email.com    <-- State changed!
    // Replica secondary-2: new@email.com    <-- State changed!
  }
}
 
// The observable state changed between reads,
// even though no new writes occurred.
// This is soft state in action.

Read Repair: Reads That Write

This means:

You issue a read-only query
The system detects that replicas disagree
The system writes to some replicas to make them consistent
Your read returns

From the perspective of other readers, data 'changed' as a side effect of your read. This is soft state in action.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Read repair in a quorum-based system
 
async function readWithRepair(
  key: string,
  replicas: Replica[],
  readQuorum: number
): Promise<any> {
  // Read from all replicas in parallel
  const responses = await Promise.all(
    replicas.map(async (replica) => {
      try {
        const result = await replica.read(key);
        return { replica, result, success: true };
      } catch (error) {
        return { replica, result: null, success: false };
      }
    })
  );
  
  const successful = responses.filter(r => r.success && r.result);
  
  if (successful.length < readQuorum) {
    throw new Error('Read quorum not achieved');
  }
  
  // Find the most recent value (highest timestamp wins)
  const latest = successful.reduce((a, b) => 
    (a.result.timestamp > b.result.timestamp) ? a : b
  );
  
  // Identify replicas with stale data
  const staleReplicas = successful.filter(
    r => r.result.timestamp < latest.result.timestamp
  );
  
  // Trigger repair in background (don't await)
  if (staleReplicas.length > 0) {
    triggerReadRepair(key, latest.result, staleReplicas);
    console.log(`Read repair triggered for ${staleReplicas.length} replicas`);
    // Note: This read operation just caused writes!
    // Other readers will see "changed" data as a result.
  }
  
  return latest.result.value;
}
 
async function triggerReadRepair(
  key: string, 
  correctValue: { value: any; timestamp: number },
  staleReplicas: { replica: Replica }[]
) {
  // Background repair - updates stale replicas
  for (const { replica } of staleReplicas) {
    replica.write(key, correctValue.value, correctValue.timestamp)
      .catch(err => console.error(`Repair failed for ${replica.id}`, err));
  }
}

Implications for Application Design

Design Principles for Soft State

•Never cache derived state indefinitely — Data fetched and processed might become stale. Implement TTL-based expiration for all caches.
•Design for idempotency — Operations may be retried due to uncertainty about state. Ensure repeated operations have the same effect as a single operation.
•Use version numbers or timestamps — Track data freshness explicitly. Let clients know how old data is and decide if it's acceptable.
•Implement optimistic concurrency — Check that data hasn't changed before applying updates. Use version vectors or timestamps to detect conflicts.
•Design for conflict resolution — Don't assume writes will 'just work.' Plan for conflicts and implement resolution strategies.
•Embrace eventual consistency in UI — Show users that data may be updating. Use loading states, timestamps, and refresh mechanisms.

The Stale Data Trap

Anti-Pattern: Read-Modify-Write Without Versioning

Consider this common pattern that breaks with soft state:

1. Read current inventory: 100 units
2. User adds 10 units to cart
3. Calculate new inventory: 100 - 10 = 90
4. Write new inventory: 90

The problem: between steps 1 and 4, another process might have updated inventory. You might overwrite their change, or they might overwrite yours.

Correct Pattern: Conditional Update with Versioning

1. Read current inventory: 100 units, version: 5
2. User adds 10 units to cart
3. Calculate new inventory: 100 - 10 = 90
4. Write new inventory: 90, ONLY IF version still 5
5. If version changed, re-read and retry

This pattern respects soft state by acknowledging that the data might have changed and handling that case explicitly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Handle soft state with optimistic concurrency control
 
interface VersionedData<T> {
  value: T;
  version: number;
  lastModified: Date;
}
 
async function updateWithOptimisticLocking<T>(
  key: string,
  updateFn: (current: T) => T,
  maxRetries: number = 3
): Promise<VersionedData<T>> {
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    // Read current state with version
    const current = await db.readWithVersion<T>(key);
    
    // Apply update function
    const newValue = updateFn(current.value);
    
    try {
      // Attempt conditional write
      const result = await db.writeIfVersion(
        key, 
        newValue, 
        current.version
      );
      
      return result; // Success!
      
    } catch (error) {
      if (error instanceof VersionConflictError) {
        // State changed between read and write (soft state!)
        console.log(`Version conflict on attempt ${attempt + 1}, retrying...`);
        
        // Add exponential backoff for high-contention scenarios
        await sleep(Math.pow(2, attempt) * 100);
        continue;
      }
      throw error; // Unexpected error
    }
  }
  
  throw new Error(`Failed to update ${key} after ${maxRetries} attempts`);
}
 
// Usage: Safely decrement inventory
async function reserveInventory(productId: string, quantity: number) {
  return await updateWithOptimisticLocking<InventoryRecord>(
    `inventory:${productId}`,
    (current) => {
      if (current.available < quantity) {
        throw new InsufficientInventoryError();
      }
      return {
        ...current,
        available: current.available - quantity,
        reserved: current.reserved + quantity
      };
    }
  );
}

TTL and Expiring State

TTLs are used extensively in distributed systems for:

Session management: User sessions expire after inactivity
Caching: Cached data expires to force refresh
Rate limiting: Rate limit counters reset after time windows
Distributed locks: Locks expire to prevent deadlocks
Temporary data: One-time codes, verification tokens

Common TTL Patterns
Use Case	Typical TTL	Why It Works
Session tokens	15-30 minutes	Balance security (short) with UX (long enough)
API response cache	5-60 seconds	Reduce load while keeping data reasonably fresh
CDN cache	1 hour - 1 day	Edge cache benefits outweigh staleness cost
Rate limit windows	1 second - 1 hour	Match rate limit policy granularity
Distributed locks	30-60 seconds	Long enough for operation, short enough to recover
DNS cache	5 minutes - 24 hours	Reduce DNS lookups while allowing updates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Redis-style operations with TTL
 
class DistributedCache {
  private redis: RedisClient;
  
  // Set with TTL - data will auto-expire (soft state!)
  async setWithTTL(
    key: string, 
    value: any, 
    ttlSeconds: number
  ): Promise<void> {
    await this.redis.setex(key, ttlSeconds, JSON.stringify(value));
  }
  
  // Get with TTL check - handle expiration gracefully
  async getWithFallback<T>(
    key: string,
    fallbackFn: () => Promise<T>,
    ttlSeconds: number
  ): Promise<T> {
    const cached = await this.redis.get(key);
    
    if (cached !== null) {
      return JSON.parse(cached) as T;
    }
    
    // Cache miss or expired - soft state changed!
    // Fetch fresh data and cache it
    const fresh = await fallbackFn();
    await this.setWithTTL(key, fresh, ttlSeconds);
    return fresh;
  }
  
  // Sliding window pattern - extend TTL on access
  async getWithSlidingExpiry<T>(
    key: string,
    ttlSeconds: number
  ): Promise<T | null> {
    const value = await this.redis.get(key);
    
    if (value !== null) {
      // Reset TTL on every access - keeps active data alive
      await this.redis.expire(key, ttlSeconds);
      return JSON.parse(value) as T;
    }
    
    return null;
  }
}
 
// Example: Session management with soft state
class SessionManager {
  private cache: DistributedCache;
  private readonly SESSION_TTL = 30 * 60; // 30 minutes
  
  async getSession(sessionId: string): Promise<Session | null> {
    // Session might expire (change to null) at any moment
    // This is soft state - the session "changes" when TTL expires
    const session = await this.cache.getWithSlidingExpiry<Session>(
      `session:${sessionId}`,
      this.SESSION_TTL
    );
    
    if (!session) {
      // Session expired or never existed
      // Application must handle this gracefully
      return null;
    }
    
    return session;
  }
}

TTL as a Feature, Not a Bug

Conflict Resolution and Merging

Common Conflict Resolution Strategies

•Last-Writer-Wins (LWW) — The write with the latest timestamp wins. Simple but can lose data silently.
•First-Writer-Wins — The first write is preserved, later writes to the same key are ignored.
•Application-Level Resolution — Conflicts are stored, and application logic decides how to merge them.
•Conflict-Free Replicated Data Types (CRDTs) — Specially designed data structures that merge automatically without conflicts.
•Operational Transformation (OT) — Transform conflicting operations so they can be applied in any order.
•Vector Clocks — Track causality to detect and handle concurrent writes correctly.

Deep Dive: Last-Writer-Wins (LWW)

LWW is the simplest and most common conflict resolution strategy. Each write includes a timestamp, and when replicas sync, the write with the highest timestamp wins.

Advantages:

Simple to implement and understand
Automatically converges to a single value
No conflict storage or resolution logic needed

Disadvantages:

Can silently lose data
Depends on synchronized clocks (problematic in distributed systems)
Doesn't capture user intent

Example Scenario:

Time T1: User A updates product name to "Widget Pro"
Time T2: User B updates product name to "Widget Plus" 
         (on a different replica, didn't see A's change)
Time T3: Replicas sync - "Widget Plus" wins (later timestamp)

Result: User A's change is silently lost.
User A sees "Widget Plus" and is confused.

This is soft state in action—User A wrote "Widget Pro" and later sees "Widget Plus" without having made any change.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
// G-Counter CRDT: A counter that only grows
// Multiple replicas can increment independently
// Always merges correctly without conflicts
 
class GCounter {
  private counts: Map<string, number>;
  private nodeId: string;
  
  constructor(nodeId: string) {
    this.nodeId = nodeId;
    this.counts = new Map();
  }
  
  // Increment only affects local node's count
  increment(): void {
    const current = this.counts.get(this.nodeId) || 0;
    this.counts.set(this.nodeId, current + 1);
  }
  
  // Get current value (sum of all nodes)
  value(): number {
    let total = 0;
    for (const count of this.counts.values()) {
      total += count;
    }
    return total;
  }
  
  // Merge with another replica - ALWAYS succeeds!
  merge(other: GCounter): void {
    for (const [nodeId, count] of other.counts.entries()) {
      const myCount = this.counts.get(nodeId) || 0;
      // Take maximum - this is mathematically guaranteed to converge
      this.counts.set(nodeId, Math.max(myCount, count));
    }
  }
  
  // Export state for replication
  toState(): Map<string, number> {
    return new Map(this.counts);
  }
}
 
// Usage example: Page view counter
// Works correctly even with network partitions
 
const nodeA = new GCounter('node-a');
const nodeB = new GCounter('node-b');
 
// Both nodes receive page views independently
nodeA.increment(); // View on node A
nodeA.increment(); // View on node A
nodeB.increment(); // View on node B
 
console.log('Before merge:');
console.log(`  Node A sees: ${nodeA.value()}`); // 2
console.log(`  Node B sees: ${nodeB.value()}`); // 1
 
// Network partition heals, nodes sync
nodeA.merge(nodeB);
nodeB.merge(nodeA);
 
console.log('After merge:');
console.log(`  Node A sees: ${nodeA.value()}`); // 3
console.log(`  Node B sees: ${nodeB.value()}`); // 3
 
// Perfect convergence! No conflicts, no data loss.
// This is soft state that "changes" but always correctly.

CRDTs: The Gold Standard for Soft State

Monitoring and Debugging Soft State

Monitoring Strategies

•Replication lag metrics — Track delay between primary writes and replica visibility. Alert on excessive lag.
•Conflict rate monitoring — Track how often conflicts occur and how they're resolved. High rates may indicate design issues.
•Version/timestamp skew — Monitor clock synchronization between nodes. Large skew breaks LWW strategies.
•Consistency checks — Periodically compare replicas and log divergence. Background anti-entropy processes do this.
•Operation logging — Log writes with full context (timestamp, node, version) for debugging.
•Distributed tracing — Follow a request through multiple services to understand propagation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// Monitor replication lag across a distributed database cluster
 
interface ReplicationMetrics {
  primaryWriteTime: Date;
  replicaVisibleTime: Date;
  lagMs: number;
  replicaId: string;
}
 
class ReplicationMonitor {
  private metricsStore: MetricsStore;
  
  async measureReplicationLag(): Promise<ReplicationMetrics[]> {
    const canaryKey = `_canary:${Date.now()}`;
    const writeTime = new Date();
    
    // Write to primary
    await this.primaryDb.write(canaryKey, writeTime.toISOString());
    
    const metrics: ReplicationMetrics[] = [];
    
    for (const replica of this.replicas) {
      const result = await this.pollForVisibility(replica, canaryKey, 30000);
      
      metrics.push({
        primaryWriteTime: writeTime,
        replicaVisibleTime: result.visibleAt,
        lagMs: result.visibleAt.getTime() - writeTime.getTime(),
        replicaId: replica.id
      });
    }
    
    // Record metrics
    for (const metric of metrics) {
      await this.metricsStore.recordGauge(
        'replication_lag_ms',
        metric.lagMs,
        { replica: metric.replicaId }
      );
      
      // Alert on excessive lag (soft state becoming "too soft")
      if (metric.lagMs > 5000) {
        this.alerting.warn(`Replication lag on ${metric.replicaId}: ${metric.lagMs}ms`);
      }
    }
    
    // Cleanup canary
    await this.primaryDb.delete(canaryKey);
    
    return metrics;
  }
  
  private async pollForVisibility(
    replica: Replica, 
    key: string, 
    timeoutMs: number
  ): Promise<{ visibleAt: Date }> {
    const startTime = Date.now();
    
    while (Date.now() - startTime < timeoutMs) {
      const result = await replica.read(key);
      if (result !== null) {
        return { visibleAt: new Date() };
      }
      await sleep(100); // Poll every 100ms
    }
    
    throw new Error(`Replication timeout on replica ${replica.id}`);
  }
}

Summary: Embracing Soft State

We've explored the second pillar of the BASE consistency model: Soft State. Let's consolidate the key takeaways:

Key Takeaways

•Soft state means data changes without explicit modification — Replication, conflict resolution, TTLs, and anti-entropy processes all cause observable state to change.
•Soft state is a consequence of availability — To achieve basic availability, we accept that different replicas may temporarily have different views.
•Applications must be designed for soft state — Assumptions about data permanence will break. Use versioning, optimistic concurrency, and idempotent operations.
•TTL expiration is soft state by design — Embrace TTLs for automatic cleanup of temporary data, sessions, caches, and rate limits.
•Conflict resolution creates soft state — When writes conflict, the resolved value might differ from any individual write. CRDTs provide elegant solutions.
•Monitoring is essential — Track replication lag, conflict rates, and version skew to understand how 'soft' your state is.

What's Next:

Page Complete

2 / 5