System Design (HLD)Distributed Locking

Distributed Locking

LevelAdvanced

Duration90 mins

TopicDistributed Locking

5 / 5

Redis Redlock: Algorithm, Controversy, and Hard Truths

The Most Controversial Distributed Lock

In 2016, a heated debate erupted between two giants of distributed systems: Salvatore Sanfilippo (antirez), creator of Redis, and Martin Kleppmann, author of Designing Data-Intensive Applications. At the center of the controversy was Redlock—an algorithm for implementing distributed locks using multiple independent Redis instances.

The exchange became a defining moment in distributed systems discourse. Kleppmann argued that Redlock was fundamentally unsafe for correctness-critical applications, while Sanfilippo defended its design. The debate touched on deep questions about time, consensus, and what it means for a distributed lock to be "safe."

This page examines Redlock in depth: how it works, why it's controversial, and what you should know before using it in production. Understanding Redlock's trade-offs is essential for any engineer making distributed locking decisions.

Critical Reading

This page presents both sides of the Redlock debate honestly. The goal is not to declare a winner but to equip you with the knowledge to make informed decisions for your specific use case. For correctness-critical locks, the consensus in the distributed systems community leans toward Kleppmann's critique. For efficiency locks, Redlock may be perfectly acceptable.

Redis Single-Instance Locking: The Baseline

Before understanding Redlock, we must understand basic Redis locking. The simplest form uses a single Redis instance with the SET command's conditional options.

The SET NX EX Pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ACQUIRE LOCK
# SET key value NX EX seconds
# - NX: Only set if key does not exist
# - EX: Set expiration time in seconds
 
SET lock:inventory "client-abc-request-123" NX EX 30
 
# Returns:
# - "OK" if lock acquired (key was created)
# - nil if lock NOT acquired (key already exists)
 
 
# RELEASE LOCK (Safe version using Lua script)
# Only delete if we are still the holder
 
EVAL "
  if redis.call('get', KEYS[1]) == ARGV[1] then
    return redis.call('del', KEYS[1])
  else
    return 0
  end
" 1 lock:inventory client-abc-request-123
 
# Returns:
# - 1 if lock released (we were holder)
# - 0 if lock NOT released (we weren't holder, or already expired)

Why the Lua Script for Release?

The release operation must be atomic: check if we're the holder AND delete the lock. Without atomicity, a race condition exists:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Timeline WITHOUT atomic release:
 
Client A                    Redis                       Client B
────────                    ─────                       ────────
GET lock:inventory
→ "client-a-123"
                                                        
(comparing values...)                                   (lock expires!)
                                                        
                            DEL lock:inventory          SET lock:inventory
                            (A's TTL expired)           "client-b-456" NX EX 30
                                                        → OK (acquired)
DEL lock:inventory
→ 1 (DELETED!)              Lock deleted                (thinks it has lock)
 
Result: A deleted B's lock. B continues unaware it lost the lock.
 
The Lua script prevents this by making check-and-delete atomic.

Single-Instance Redis Is Not Safe for Correctness

A single Redis instance has no replication and no consensus. If Redis crashes, all lock state is lost. If Redis has a network partition from some clients, those clients cannot acquire locks while others still can. Single-instance Redis locks are only appropriate for efficiency locks where occasional double-granting is acceptable.

The Redlock Algorithm: Design and Mechanics

Redlock was designed to address single-instance limitations by using N independent Redis masters (typically 5) and requiring a majority quorum for lock acquisition.

Key Insight: If you acquire a lock on a majority of independent servers, and each has its own clock and failure mode, the probability of all failing in a way that violates safety should be low—at least, that's the theory.

The Algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
REDLOCK ALGORITHM (N=5 independent Redis instances):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
PARAMETERS:
- N = 5 (number of Redis instances, should be odd)
- TTL = Lock validity time (e.g., 10 seconds)
- Quorum = N/2 + 1 = 3 (majority)
- Clock drift factor = 0.01 (1% assumed max clock drift)
 
ACQUIRE LOCK:
━━━━━━━━━━━━
 
1. Get current time T1 (in milliseconds)
 
2. Attempt to acquire lock on ALL N instances sequentially:
   For each Redis instance i (from 1 to 5):
     - SET resource_name random_value NX PX (TTL in ms)
     - Use very short timeout per instance (e.g., 5-50ms)
     - If instance unreachable, move on immediately
 
3. Get current time T2 after all attempts
 
4. Calculate elapsed time: elapsed = T2 - T1
 
5. Calculate validity time:
   validity = TTL - elapsed - (TTL * clock_drift_factor)
   
   Example: TTL=10000ms, elapsed=50ms, drift=1%
   validity = 10000 - 50 - 100 = 9850ms
 
6. Check success criteria:
   - Did we acquire on MAJORITY (≥ 3) instances?  AND
   - Is validity > 0?
   
   If YES: Lock acquired! Use for at most 'validity' milliseconds.
   If NO:  Lock NOT acquired. Go to RELEASE.
 
 
RELEASE LOCK:
━━━━━━━━━━━━
 
1. Attempt to release on ALL N instances (not just ones we acquired from):
   For each Redis instance i:
     - Run release Lua script with our random_value
     - Ignore failures/timeouts
 
Why release on all? We may have acquired on an instance but think we 
didn't due to network issues when reading the response.

Visual Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
5 Independent Redis Instances: R1, R2, R3, R4, R5
 
Client A attempts lock with TTL=10s:
 
Time    R1          R2          R3          R4          R5
────    ──          ──          ──          ──          ──
t0      SET OK ✓    SET OK ✓    SET OK ✓    (timeout)   SET OK ✓
        5ms         10ms        8ms         50ms        12ms
 
Elapsed: 85ms total
Successful instances: 4 (R1, R2, R3, R5) ← Majority achieved!
Validity: 10000 - 85 - 100 = 9815ms
 
Result: Lock ACQUIRED on 4/5 instances.
        Client A may hold lock for up to 9815ms.
 
 
Client B attempts same lock (while A holds):
 
Time    R1          R2          R3          R4          R5
────    ──          ──          ──          ──          ──
t100    SET nil ✗   SET nil ✗   SET nil ✗   SET OK ✓    SET nil ✗
        (A's lock)  (A's lock)  (A's lock)  (was down)  (A's lock)
 
Successful instances: 1 (only R4)
1 < 3 (majority) → Lock NOT acquired
Client B must retry after delay.

Why Independent Instances?

Redlock requires truly independent Redis instances—not replicated masters. Redis replication is asynchronous; a write to master might not reach replicas before failover. If you used a Redis Cluster, a master failure could lose lock state. Independent instances ensure no single point of failure and no replication lag issues.

The Time Assumption Problem

Redlock's correctness depends critically on bounded clock behavior. This is the heart of Kleppmann's critique.

The Timing Assumption:

Redlock assumes that:

Process execution times are bounded (no long GC pauses)
Network delays are bounded and predictable
Clock drift is bounded (the clock_drift_factor)

If any of these assumptions are violated, safety can be violated.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
SCENARIO: GC Pause Causes Safety Violation
 
Timeline:
────────
 
t0:    Client A acquires lock on R1, R2, R3 (majority)
       TTL = 10 seconds
       Validity = 9.8 seconds
       
t1:    Client A begins critical section
 
t2:    Client A enters LONG GC PAUSE (e.g., 15 seconds)
       (A is frozen, time appears stopped from A's perspective)
 
t10:   Lock expires on R1, R2, R3 (TTL reached)
       A is still frozen in GC
 
t11:   Client B acquires lock on R1, R2, R3
       B begins critical section
       
t15:   Client A's GC completes
       A checks: "Do I still have validity time left?"
       A's calculation: t_now (t15) - t_acquired (t0) = 15 seconds
       Wait, that's > TTL... but A might not check properly!
       
       Or worse: A's local clock was also affected by the pause,
       so A believes only 1 second has passed.
       A proceeds with critical section.
 
t16:   BOTH A AND B ARE IN CRITICAL SECTION
       Mutual exclusion violated.
 
 
THE PROBLEM:
Redlock's validity calculation happens at acquire time.
If a pause occurs AFTER acquire but BEFORE checking validity,
the client can believe it still holds a lock that has expired.

Kleppmann's Core Critique

The fundamental issue is that Redlock assumes synchronous system behavior (bounded delays) but runs on asynchronous infrastructure where GC pauses, network delays, and CPU throttling can cause unbounded delays. These delays can happen at the worst possible time—between checking lock validity and performing the critical operation.

Why Zookeeper/etcd Don't Have This Problem:

Consensus-based systems solve this differently:

Consensus System Advantages

•Active session monitoring — The coordinator server detects when a client stops heartbeating and revokes the lock server-side, not client-side.
•Single source of truth — There's one leader that authoritatively knows who holds the lock. Redlock has N independent sources of truth that can diverge.
•Fencing token integration — The lock grant includes a monotonic token. The protected resource rejects stale tokens regardless of timing.
•No reliance on client clocks — The client doesn't calculate validity; the server manages lease lifetime.

The Kleppmann-Sanfilippo Debate

In February 2016, Martin Kleppmann published "How to do distributed locking" which systematically critiqued Redlock. Salvatore Sanfilippo (antirez) responded with a rebuttal. The exchange illuminated fundamental questions about distributed locking.

Kleppmann's Main Arguments:

Kleppmann's Critique of Redlock

•Timing assumptions are unsafe — Redlock assumes bounded execution times and clock behavior, which cannot be guaranteed on real systems with GC, VM pauses, and network partitions.
•No fencing token — Redlock doesn't provide a monotonic token that protected resources can use to reject stale operations. The random value isn't monotonic.
•For efficiency, use simpler solutions — If you don't need strict correctness, a single Redis instance is simpler and equally good. If you need correctness, use a proper consensus system.
•False sense of security — Redlock is complex enough that users might believe it's safe for correctness, when it likely isn't.

Sanfilippo's Defense:

Sanfilippo's Defense of Redlock

•Timing assumptions are practical — While theoretically unbounded, in practice GC pauses and network delays are usually bounded. Systems are designed with this in mind.
•Fencing can be added — Redlock can be combined with fencing tokens. The algorithm doesn't preclude this.
•Consensus systems also have edge cases — Zookeeper and etcd also rely on timing (session timeouts). The difference is degree, not kind.
•Redlock is practical and useful — Many systems use Redlock successfully. Perfect is the enemy of good.

The Community's Verdict

The distributed systems community largely sided with Kleppmann. The general consensus is: (1) For efficiency locks, Redlock adds complexity over single-instance Redis without clear benefit. (2) For correctness locks, use a consensus-based system like Zookeeper or etcd. Redlock sits in an awkward middle ground.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
THE CENTRAL DISAGREEMENT:
━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Kleppmann's Position:
  "A distributed lock algorithm MUST be correct under asynchronous 
   network assumptions. Redlock is not, because it relies on
   synchronized clocks and bounded process execution."
 
Sanfilippo's Position:
  "Real systems are quasi-synchronous. With reasonable bounds on
   clock drift and process pauses, Redlock provides practical safety."
 
Underlying Philosophical Difference:
  Kleppmann: Safety guarantees should be mathematical, not probabilistic.
  Sanfilippo: Probabilistic guarantees that work 99.999% of the time
              are sufficient for most use cases.
 
Who's Right?
  Depends on your use case:
  - If failure means data corruption: Kleppmann is right.
  - If failure means duplicate work: Sanfilippo may be right.
  
  The problem is that engineers often don't know which category
  they're in until disaster strikes.

Redlock with Fencing Tokens: A Partial Fix

One mitigation for Redlock's timing issues is to combine it with fencing tokens. The idea: even if mutual exclusion fails due to timing, the protected resource can reject operations from stale lock holders.

Implementing Fencing with Redlock:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import time
import redis
from typing import Optional, Tuple
 
class RedlockWithFencing:
    def __init__(self, redis_instances: list, ttl_ms: int = 10000):
        self.instances = [redis.Redis(host=h, port=p) for h, p in redis_instances]
        self.ttl_ms = ttl_ms
        self.quorum = len(self.instances) // 2 + 1
        self.clock_drift_factor = 0.01
    
    def acquire(self, resource: str, client_id: str) -> Optional[Tuple[int, float]]:
        """
        Acquire lock and return (fencing_token, validity_ms) or None.
        
        The fencing token is the maximum lock_version across all instances.
        This provides weak monotonicity - not perfect, but better than nothing.
        """
        start_time = time.time() * 1000
        acquired_count = 0
        max_version = 0
        
        for instance in self.instances:
            try:
                # Use INCR to generate a version, then SET with that version
                version = instance.incr(f"{resource}:version")
                result = instance.set(
                    resource, 
                    f"{client_id}:{version}",
                    nx=True, 
                    px=self.ttl_ms
                )
                if result:
                    acquired_count += 1
                    max_version = max(max_version, version)
            except redis.RedisError:
                continue
        
        elapsed = time.time() * 1000 - start_time
        validity = self.ttl_ms - elapsed - (self.ttl_ms * self.clock_drift_factor)
        
        if acquired_count >= self.quorum and validity > 0:
            return (max_version, validity)
        
        # Failed to acquire, release any partial locks
        self.release(resource, client_id)
        return None
    
    def release(self, resource: str, client_id: str):
        """Release lock on all instances."""
        release_script = """
        local val = redis.call('get', KEYS[1])
        if val and string.find(val, ARGV[1]) then
            return redis.call('del', KEYS[1])
        end
        return 0
        """
        for instance in self.instances:
            try:
                instance.eval(release_script, 1, resource, client_id)
            except redis.RedisError:
                continue
 
 
# Protected resource with fencing
class FencedDatabase:
    def __init__(self):
        self.highest_token = 0
    
    def update(self, fencing_token: int, data: dict) -> bool:
        if fencing_token < self.highest_token:
            raise StaleTokenError(
                f"Token {fencing_token} < current {self.highest_token}"
            )
        
        self.highest_token = fencing_token
        # Proceed with update
        self._commit(data)
        return True

Limitations of Fencing with Redlock:

Fencing Limitations

•Token not globally monotonic — Different Redis instances generate different version numbers. The "max" approach provides weak monotonicity but can still have gaps.
•Requires resource support — The protected resource must track and compare tokens. Third-party APIs, legacy databases, and many resources can't do this.
•Adds complexity — Now you're managing both Redlock infrastructure AND fencing token infrastructure. At this point, why not use etcd?
•Token generation is another failure point — If the version counters diverge or are lost, token ordering breaks down.

Kleppmann's Take on Fencing

"If you use fencing tokens, Redlock becomes an optimization to reduce the number of times you get rejected by the resource. The actual safety comes from fencing, not from Redlock. At that point, a single Redis instance provides the same optimization with less complexity."

When to Use (or Avoid) Redlock

Despite the controversy, Redlock has its place. The key is matching the tool to the requirement.

Decision Framework:

Redlock MAY Be Acceptable When

•Lock is for efficiency (avoiding duplicate work)
•Double-granting causes wasted resources, not data corruption
•You're already running Redis and don't want to operate Zookeeper/etcd
•Lock duration is short (seconds), reducing timing violation window
•You have retry mechanisms that can recover from occasional failures
•You're using fencing tokens as the actual safety mechanism

Avoid Redlock When

•Lock is for correctness (data integrity)
•Double-granting causes financial loss or data corruption
•You're dealing with financial transactions or inventory
•Lock holders may have long GC pauses (JVM, etc.)
•You need leader election (must have single leader)
•You cannot implement fencing at the protected resource

Lock Solution Selection Guide
Use Case	Recommended Solution	Why
Cache warming coordination	Single Redis SETNX	Simple, low stakes
Prevent duplicate batch jobs	Single Redis or Redlock	Duplicate is wasteful but not catastrophic
Rate limiting coordination	Redis with Lua scripts	Approximate limits are fine
Inventory management	Zookeeper/etcd + fencing	Data integrity critical
Payment processing	Zookeeper/etcd + fencing	Financial correctness required
Database leader election	Zookeeper/etcd	Single leader mandatory
Distributed cron	Zookeeper/etcd or Redlock	Depends on job idempotency
File exclusive access	Zookeeper/etcd	Partial writes cause corruption

The Simplicity Argument

For efficiency locks, a single Redis instance may be simpler and equally effective as Redlock. The additional complexity of Redlock (5 instances, quorum logic, validity calculation) provides marginal benefit for cases where occasional double-granting is acceptable anyway. If you need that marginal improvement, you might actually need a consensus system.

Redlock Client Libraries and Alternatives

If you decide to use Redlock, use a well-tested client library rather than implementing the algorithm yourself.

Official and Popular Implementations:

Redlock Client Libraries

•Python: redlock-py, pottery — Multiple implementations with varying quality
•Node.js: redlock — Popular implementation by Mike Marchetti
•Java: redisson — Comprehensive Redis client with Redlock support
•Go: redsync — Go implementation of Redlock
•Ruby: redlock-rb — Ruby implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
const Redlock = require('redlock');
const Redis = require('ioredis');
 
// Connect to 5 independent Redis instances
const redisClients = [
    new Redis({ host: 'redis1', port: 6379 }),
    new Redis({ host: 'redis2', port: 6379 }),
    new Redis({ host: 'redis3', port: 6379 }),
    new Redis({ host: 'redis4', port: 6379 }),
    new Redis({ host: 'redis5', port: 6379 }),
];
 
const redlock = new Redlock(redisClients, {
    // Time in ms before retrying
    retryDelay: 200,
    // Max attempts to acquire lock
    retryCount: 10,
    // Clock drift factor (0.01 = 1%)
    driftFactor: 0.01,
});
 
async function doExclusiveWork() {
    try {
        // Acquire lock for 10 seconds
        const lock = await redlock.acquire(['locks:my-resource'], 10000);
        
        console.log('Lock acquired:', lock.value);
        
        try {
            // CRITICAL SECTION
            // Use lock.expiration to check remaining time
            console.log('Expiration:', lock.expiration);
            
            await processExclusiveOperation();
            
        } finally {
            // Release lock
            await lock.release();
            console.log('Lock released');
        }
        
    } catch (error) {
        if (error.name === 'LockError') {
            console.log('Could not acquire lock:', error.message);
        } else {
            throw error;
        }
    }
}
 
// Auto-extending locks (for long operations)
async function doLongWork() {
    const lock = await redlock.acquire(['locks:long-operation'], 10000);
    
    // Extend lock every 5 seconds
    const extendInterval = setInterval(async () => {
        try {
            await lock.extend(10000);
            console.log('Lock extended');
        } catch (e) {
            console.error('Failed to extend lock!');
            clearInterval(extendInterval);
        }
    }, 5000);
    
    try {
        await veryLongOperation();
    } finally {
        clearInterval(extendInterval);
        await lock.release();
    }
}

Simpler Alternatives to Redlock:

If you're questioning whether Redlock is right for you, consider these alternatives:

Alternatives to Redlock
Alternative	When to Use	Trade-off
Single Redis SETNX	Efficiency locks, can tolerate Redis failure	Single point of failure
Redis Sentinel + SETNX	Efficiency locks with failover	Async replication can lose locks
PostgreSQL advisory locks	Already using PostgreSQL, single-process critical sections	Database as coordination service
etcd via K8s	Running Kubernetes	Adds dependency, but battle-tested
AWS DynamoDB locks	AWS environment	Managed service, conditional writes
GCP Spanner locks	GCP environment	External consistency, higher latency

Summary: The Redlock Trade-off Space

We've explored Redlock in depth—its algorithm, limitations, the famous debate, and practical guidance. Let's consolidate the key insights:

Key Takeaways

•Redlock uses majority quorum — Acquiring on N/2+1 independent Redis instances provides fault tolerance against minority failures.
•Redlock relies on timing assumptions — Bounded clock drift, bounded GC pauses, and bounded network delays are assumed but not guaranteed.
•The Kleppmann critique is accepted by most experts — For correctness-critical locks, Redlock's timing assumptions are too risky.
•Fencing tokens can help but don't fully solve the problem — Fencing adds safety but also complexity; at that point, consider consensus systems.
•Efficiency locks vs. correctness locks — Redlock may be acceptable for efficiency; consensus systems are required for correctness.
•Simpler may be better — For many use cases, single-instance Redis is simpler than Redlock with similar practical safety.

The Bottom Line:

Redlock occupies an uncomfortable middle ground:

It's more complex than single-instance Redis
It's less safe than consensus-based systems
The additional complexity over single-instance doesn't buy much for efficiency locks
The inability to match consensus systems makes it unsuitable for correctness locks

For most engineers, the recommendation is clear:

Efficiency locks: Single Redis instance with SETNX
Correctness locks: Zookeeper, etcd, or a cloud-managed coordination service
Redlock: Only when you specifically understand and accept its limitations

Module Complete

You have completed the Distributed Locking module. You now understand why distributed locks are needed, the formal properties they must satisfy, how Zookeeper and etcd implement correct locks, and the trade-offs of the Redis Redlock algorithm. You're equipped to make informed decisions about distributed locking in your systems—choosing the right tool based on whether you need efficiency or correctness, and understanding the failure modes of each approach.

Recommended Reading:

Martin Kleppmann — "How to do distributed locking" (blog post, 2016)
Salvatore Sanfilippo — "Is Redlock safe?" (response blog post, 2016)
Designing Data-Intensive Applications — Chapter 8: The Trouble with Distributed Systems
Redis Documentation — Distributed Locks with Redis (with caveats section)

5 / 5

Loading learning content...

System Design (HLD)Distributed Locking

Distributed Locking

LevelAdvanced

Duration90 mins

TopicDistributed Locking

5 / 5

Redis Redlock: Algorithm, Controversy, and Hard Truths

The Most Controversial Distributed Lock

Critical Reading

Redis Single-Instance Locking: The Baseline

Before understanding Redlock, we must understand basic Redis locking. The simplest form uses a single Redis instance with the SET command's conditional options.

The SET NX EX Pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ACQUIRE LOCK
# SET key value NX EX seconds
# - NX: Only set if key does not exist
# - EX: Set expiration time in seconds
 
SET lock:inventory "client-abc-request-123" NX EX 30
 
# Returns:
# - "OK" if lock acquired (key was created)
# - nil if lock NOT acquired (key already exists)
 
 
# RELEASE LOCK (Safe version using Lua script)
# Only delete if we are still the holder
 
EVAL "
  if redis.call('get', KEYS[1]) == ARGV[1] then
    return redis.call('del', KEYS[1])
  else
    return 0
  end
" 1 lock:inventory client-abc-request-123
 
# Returns:
# - 1 if lock released (we were holder)
# - 0 if lock NOT released (we weren't holder, or already expired)

Why the Lua Script for Release?

The release operation must be atomic: check if we're the holder AND delete the lock. Without atomicity, a race condition exists:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Timeline WITHOUT atomic release:
 
Client A                    Redis                       Client B
────────                    ─────                       ────────
GET lock:inventory
→ "client-a-123"
                                                        
(comparing values...)                                   (lock expires!)
                                                        
                            DEL lock:inventory          SET lock:inventory
                            (A's TTL expired)           "client-b-456" NX EX 30
                                                        → OK (acquired)
DEL lock:inventory
→ 1 (DELETED!)              Lock deleted                (thinks it has lock)
 
Result: A deleted B's lock. B continues unaware it lost the lock.
 
The Lua script prevents this by making check-and-delete atomic.

Single-Instance Redis Is Not Safe for Correctness

The Redlock Algorithm: Design and Mechanics

Redlock was designed to address single-instance limitations by using N independent Redis masters (typically 5) and requiring a majority quorum for lock acquisition.

The Algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
REDLOCK ALGORITHM (N=5 independent Redis instances):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
PARAMETERS:
- N = 5 (number of Redis instances, should be odd)
- TTL = Lock validity time (e.g., 10 seconds)
- Quorum = N/2 + 1 = 3 (majority)
- Clock drift factor = 0.01 (1% assumed max clock drift)
 
ACQUIRE LOCK:
━━━━━━━━━━━━
 
1. Get current time T1 (in milliseconds)
 
2. Attempt to acquire lock on ALL N instances sequentially:
   For each Redis instance i (from 1 to 5):
     - SET resource_name random_value NX PX (TTL in ms)
     - Use very short timeout per instance (e.g., 5-50ms)
     - If instance unreachable, move on immediately
 
3. Get current time T2 after all attempts
 
4. Calculate elapsed time: elapsed = T2 - T1
 
5. Calculate validity time:
   validity = TTL - elapsed - (TTL * clock_drift_factor)
   
   Example: TTL=10000ms, elapsed=50ms, drift=1%
   validity = 10000 - 50 - 100 = 9850ms
 
6. Check success criteria:
   - Did we acquire on MAJORITY (≥ 3) instances?  AND
   - Is validity > 0?
   
   If YES: Lock acquired! Use for at most 'validity' milliseconds.
   If NO:  Lock NOT acquired. Go to RELEASE.
 
 
RELEASE LOCK:
━━━━━━━━━━━━
 
1. Attempt to release on ALL N instances (not just ones we acquired from):
   For each Redis instance i:
     - Run release Lua script with our random_value
     - Ignore failures/timeouts
 
Why release on all? We may have acquired on an instance but think we 
didn't due to network issues when reading the response.

Visual Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
5 Independent Redis Instances: R1, R2, R3, R4, R5
 
Client A attempts lock with TTL=10s:
 
Time    R1          R2          R3          R4          R5
────    ──          ──          ──          ──          ──
t0      SET OK ✓    SET OK ✓    SET OK ✓    (timeout)   SET OK ✓
        5ms         10ms        8ms         50ms        12ms
 
Elapsed: 85ms total
Successful instances: 4 (R1, R2, R3, R5) ← Majority achieved!
Validity: 10000 - 85 - 100 = 9815ms
 
Result: Lock ACQUIRED on 4/5 instances.
        Client A may hold lock for up to 9815ms.
 
 
Client B attempts same lock (while A holds):
 
Time    R1          R2          R3          R4          R5
────    ──          ──          ──          ──          ──
t100    SET nil ✗   SET nil ✗   SET nil ✗   SET OK ✓    SET nil ✗
        (A's lock)  (A's lock)  (A's lock)  (was down)  (A's lock)
 
Successful instances: 1 (only R4)
1 < 3 (majority) → Lock NOT acquired
Client B must retry after delay.

Why Independent Instances?

The Time Assumption Problem

Redlock's correctness depends critically on bounded clock behavior. This is the heart of Kleppmann's critique.

The Timing Assumption:

Redlock assumes that:

Process execution times are bounded (no long GC pauses)
Network delays are bounded and predictable
Clock drift is bounded (the clock_drift_factor)

If any of these assumptions are violated, safety can be violated.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
SCENARIO: GC Pause Causes Safety Violation
 
Timeline:
────────
 
t0:    Client A acquires lock on R1, R2, R3 (majority)
       TTL = 10 seconds
       Validity = 9.8 seconds
       
t1:    Client A begins critical section
 
t2:    Client A enters LONG GC PAUSE (e.g., 15 seconds)
       (A is frozen, time appears stopped from A's perspective)
 
t10:   Lock expires on R1, R2, R3 (TTL reached)
       A is still frozen in GC
 
t11:   Client B acquires lock on R1, R2, R3
       B begins critical section
       
t15:   Client A's GC completes
       A checks: "Do I still have validity time left?"
       A's calculation: t_now (t15) - t_acquired (t0) = 15 seconds
       Wait, that's > TTL... but A might not check properly!
       
       Or worse: A's local clock was also affected by the pause,
       so A believes only 1 second has passed.
       A proceeds with critical section.
 
t16:   BOTH A AND B ARE IN CRITICAL SECTION
       Mutual exclusion violated.
 
 
THE PROBLEM:
Redlock's validity calculation happens at acquire time.
If a pause occurs AFTER acquire but BEFORE checking validity,
the client can believe it still holds a lock that has expired.

Kleppmann's Core Critique

Why Zookeeper/etcd Don't Have This Problem:

Consensus-based systems solve this differently:

Consensus System Advantages

•Active session monitoring — The coordinator server detects when a client stops heartbeating and revokes the lock server-side, not client-side.
•Single source of truth — There's one leader that authoritatively knows who holds the lock. Redlock has N independent sources of truth that can diverge.
•Fencing token integration — The lock grant includes a monotonic token. The protected resource rejects stale tokens regardless of timing.
•No reliance on client clocks — The client doesn't calculate validity; the server manages lease lifetime.

The Kleppmann-Sanfilippo Debate

Kleppmann's Main Arguments:

Kleppmann's Critique of Redlock

•Timing assumptions are unsafe — Redlock assumes bounded execution times and clock behavior, which cannot be guaranteed on real systems with GC, VM pauses, and network partitions.
•No fencing token — Redlock doesn't provide a monotonic token that protected resources can use to reject stale operations. The random value isn't monotonic.
•For efficiency, use simpler solutions — If you don't need strict correctness, a single Redis instance is simpler and equally good. If you need correctness, use a proper consensus system.
•False sense of security — Redlock is complex enough that users might believe it's safe for correctness, when it likely isn't.

Sanfilippo's Defense:

Sanfilippo's Defense of Redlock

•Timing assumptions are practical — While theoretically unbounded, in practice GC pauses and network delays are usually bounded. Systems are designed with this in mind.
•Fencing can be added — Redlock can be combined with fencing tokens. The algorithm doesn't preclude this.
•Consensus systems also have edge cases — Zookeeper and etcd also rely on timing (session timeouts). The difference is degree, not kind.
•Redlock is practical and useful — Many systems use Redlock successfully. Perfect is the enemy of good.

The Community's Verdict

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
THE CENTRAL DISAGREEMENT:
━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Kleppmann's Position:
  "A distributed lock algorithm MUST be correct under asynchronous 
   network assumptions. Redlock is not, because it relies on
   synchronized clocks and bounded process execution."
 
Sanfilippo's Position:
  "Real systems are quasi-synchronous. With reasonable bounds on
   clock drift and process pauses, Redlock provides practical safety."
 
Underlying Philosophical Difference:
  Kleppmann: Safety guarantees should be mathematical, not probabilistic.
  Sanfilippo: Probabilistic guarantees that work 99.999% of the time
              are sufficient for most use cases.
 
Who's Right?
  Depends on your use case:
  - If failure means data corruption: Kleppmann is right.
  - If failure means duplicate work: Sanfilippo may be right.
  
  The problem is that engineers often don't know which category
  they're in until disaster strikes.

Redlock with Fencing Tokens: A Partial Fix

Implementing Fencing with Redlock:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import time
import redis
from typing import Optional, Tuple
 
class RedlockWithFencing:
    def __init__(self, redis_instances: list, ttl_ms: int = 10000):
        self.instances = [redis.Redis(host=h, port=p) for h, p in redis_instances]
        self.ttl_ms = ttl_ms
        self.quorum = len(self.instances) // 2 + 1
        self.clock_drift_factor = 0.01
    
    def acquire(self, resource: str, client_id: str) -> Optional[Tuple[int, float]]:
        """
        Acquire lock and return (fencing_token, validity_ms) or None.
        
        The fencing token is the maximum lock_version across all instances.
        This provides weak monotonicity - not perfect, but better than nothing.
        """
        start_time = time.time() * 1000
        acquired_count = 0
        max_version = 0
        
        for instance in self.instances:
            try:
                # Use INCR to generate a version, then SET with that version
                version = instance.incr(f"{resource}:version")
                result = instance.set(
                    resource, 
                    f"{client_id}:{version}",
                    nx=True, 
                    px=self.ttl_ms
                )
                if result:
                    acquired_count += 1
                    max_version = max(max_version, version)
            except redis.RedisError:
                continue
        
        elapsed = time.time() * 1000 - start_time
        validity = self.ttl_ms - elapsed - (self.ttl_ms * self.clock_drift_factor)
        
        if acquired_count >= self.quorum and validity > 0:
            return (max_version, validity)
        
        # Failed to acquire, release any partial locks
        self.release(resource, client_id)
        return None
    
    def release(self, resource: str, client_id: str):
        """Release lock on all instances."""
        release_script = """
        local val = redis.call('get', KEYS[1])
        if val and string.find(val, ARGV[1]) then
            return redis.call('del', KEYS[1])
        end
        return 0
        """
        for instance in self.instances:
            try:
                instance.eval(release_script, 1, resource, client_id)
            except redis.RedisError:
                continue
 
 
# Protected resource with fencing
class FencedDatabase:
    def __init__(self):
        self.highest_token = 0
    
    def update(self, fencing_token: int, data: dict) -> bool:
        if fencing_token < self.highest_token:
            raise StaleTokenError(
                f"Token {fencing_token} < current {self.highest_token}"
            )
        
        self.highest_token = fencing_token
        # Proceed with update
        self._commit(data)
        return True

Limitations of Fencing with Redlock:

Fencing Limitations

•Token not globally monotonic — Different Redis instances generate different version numbers. The "max" approach provides weak monotonicity but can still have gaps.
•Requires resource support — The protected resource must track and compare tokens. Third-party APIs, legacy databases, and many resources can't do this.
•Adds complexity — Now you're managing both Redlock infrastructure AND fencing token infrastructure. At this point, why not use etcd?
•Token generation is another failure point — If the version counters diverge or are lost, token ordering breaks down.

Kleppmann's Take on Fencing

When to Use (or Avoid) Redlock

Despite the controversy, Redlock has its place. The key is matching the tool to the requirement.

Decision Framework:

Redlock MAY Be Acceptable When

•Lock is for efficiency (avoiding duplicate work)
•Double-granting causes wasted resources, not data corruption
•You're already running Redis and don't want to operate Zookeeper/etcd
•Lock duration is short (seconds), reducing timing violation window
•You have retry mechanisms that can recover from occasional failures
•You're using fencing tokens as the actual safety mechanism

Avoid Redlock When

•Lock is for correctness (data integrity)
•Double-granting causes financial loss or data corruption
•You're dealing with financial transactions or inventory
•Lock holders may have long GC pauses (JVM, etc.)
•You need leader election (must have single leader)
•You cannot implement fencing at the protected resource

Lock Solution Selection Guide
Use Case	Recommended Solution	Why
Cache warming coordination	Single Redis SETNX	Simple, low stakes
Prevent duplicate batch jobs	Single Redis or Redlock	Duplicate is wasteful but not catastrophic
Rate limiting coordination	Redis with Lua scripts	Approximate limits are fine
Inventory management	Zookeeper/etcd + fencing	Data integrity critical
Payment processing	Zookeeper/etcd + fencing	Financial correctness required
Database leader election	Zookeeper/etcd	Single leader mandatory
Distributed cron	Zookeeper/etcd or Redlock	Depends on job idempotency
File exclusive access	Zookeeper/etcd	Partial writes cause corruption

The Simplicity Argument

Redlock Client Libraries and Alternatives

If you decide to use Redlock, use a well-tested client library rather than implementing the algorithm yourself.

Official and Popular Implementations:

Redlock Client Libraries

•Python: redlock-py, pottery — Multiple implementations with varying quality
•Node.js: redlock — Popular implementation by Mike Marchetti
•Java: redisson — Comprehensive Redis client with Redlock support
•Go: redsync — Go implementation of Redlock
•Ruby: redlock-rb — Ruby implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
const Redlock = require('redlock');
const Redis = require('ioredis');
 
// Connect to 5 independent Redis instances
const redisClients = [
    new Redis({ host: 'redis1', port: 6379 }),
    new Redis({ host: 'redis2', port: 6379 }),
    new Redis({ host: 'redis3', port: 6379 }),
    new Redis({ host: 'redis4', port: 6379 }),
    new Redis({ host: 'redis5', port: 6379 }),
];
 
const redlock = new Redlock(redisClients, {
    // Time in ms before retrying
    retryDelay: 200,
    // Max attempts to acquire lock
    retryCount: 10,
    // Clock drift factor (0.01 = 1%)
    driftFactor: 0.01,
});
 
async function doExclusiveWork() {
    try {
        // Acquire lock for 10 seconds
        const lock = await redlock.acquire(['locks:my-resource'], 10000);
        
        console.log('Lock acquired:', lock.value);
        
        try {
            // CRITICAL SECTION
            // Use lock.expiration to check remaining time
            console.log('Expiration:', lock.expiration);
            
            await processExclusiveOperation();
            
        } finally {
            // Release lock
            await lock.release();
            console.log('Lock released');
        }
        
    } catch (error) {
        if (error.name === 'LockError') {
            console.log('Could not acquire lock:', error.message);
        } else {
            throw error;
        }
    }
}
 
// Auto-extending locks (for long operations)
async function doLongWork() {
    const lock = await redlock.acquire(['locks:long-operation'], 10000);
    
    // Extend lock every 5 seconds
    const extendInterval = setInterval(async () => {
        try {
            await lock.extend(10000);
            console.log('Lock extended');
        } catch (e) {
            console.error('Failed to extend lock!');
            clearInterval(extendInterval);
        }
    }, 5000);
    
    try {
        await veryLongOperation();
    } finally {
        clearInterval(extendInterval);
        await lock.release();
    }
}

Simpler Alternatives to Redlock:

If you're questioning whether Redlock is right for you, consider these alternatives:

Alternatives to Redlock
Alternative	When to Use	Trade-off
Single Redis SETNX	Efficiency locks, can tolerate Redis failure	Single point of failure
Redis Sentinel + SETNX	Efficiency locks with failover	Async replication can lose locks
PostgreSQL advisory locks	Already using PostgreSQL, single-process critical sections	Database as coordination service
etcd via K8s	Running Kubernetes	Adds dependency, but battle-tested
AWS DynamoDB locks	AWS environment	Managed service, conditional writes
GCP Spanner locks	GCP environment	External consistency, higher latency

Summary: The Redlock Trade-off Space

We've explored Redlock in depth—its algorithm, limitations, the famous debate, and practical guidance. Let's consolidate the key insights:

Key Takeaways

•Redlock uses majority quorum — Acquiring on N/2+1 independent Redis instances provides fault tolerance against minority failures.
•Redlock relies on timing assumptions — Bounded clock drift, bounded GC pauses, and bounded network delays are assumed but not guaranteed.
•The Kleppmann critique is accepted by most experts — For correctness-critical locks, Redlock's timing assumptions are too risky.
•Fencing tokens can help but don't fully solve the problem — Fencing adds safety but also complexity; at that point, consider consensus systems.
•Efficiency locks vs. correctness locks — Redlock may be acceptable for efficiency; consensus systems are required for correctness.
•Simpler may be better — For many use cases, single-instance Redis is simpler than Redlock with similar practical safety.

The Bottom Line:

Redlock occupies an uncomfortable middle ground:

It's more complex than single-instance Redis
It's less safe than consensus-based systems
The additional complexity over single-instance doesn't buy much for efficiency locks
The inability to match consensus systems makes it unsuitable for correctness locks

For most engineers, the recommendation is clear:

Efficiency locks: Single Redis instance with SETNX
Correctness locks: Zookeeper, etcd, or a cloud-managed coordination service
Redlock: Only when you specifically understand and accept its limitations

Module Complete

Recommended Reading:

Martin Kleppmann — "How to do distributed locking" (blog post, 2016)
Salvatore Sanfilippo — "Is Redlock safe?" (response blog post, 2016)
Designing Data-Intensive Applications — Chapter 8: The Trouble with Distributed Systems
Redis Documentation — Distributed Locks with Redis (with caveats section)

5 / 5