Loading learning content...
In 2016, a heated debate erupted between two giants of distributed systems: Salvatore Sanfilippo (antirez), creator of Redis, and Martin Kleppmann, author of Designing Data-Intensive Applications. At the center of the controversy was Redlock—an algorithm for implementing distributed locks using multiple independent Redis instances.
The exchange became a defining moment in distributed systems discourse. Kleppmann argued that Redlock was fundamentally unsafe for correctness-critical applications, while Sanfilippo defended its design. The debate touched on deep questions about time, consensus, and what it means for a distributed lock to be "safe."
This page examines Redlock in depth: how it works, why it's controversial, and what you should know before using it in production. Understanding Redlock's trade-offs is essential for any engineer making distributed locking decisions.
This page presents both sides of the Redlock debate honestly. The goal is not to declare a winner but to equip you with the knowledge to make informed decisions for your specific use case. For correctness-critical locks, the consensus in the distributed systems community leans toward Kleppmann's critique. For efficiency locks, Redlock may be perfectly acceptable.
Before understanding Redlock, we must understand basic Redis locking. The simplest form uses a single Redis instance with the SET command's conditional options.
The SET NX EX Pattern:
1234567891011121314151617181920212223242526
# ACQUIRE LOCK# SET key value NX EX seconds# - NX: Only set if key does not exist# - EX: Set expiration time in seconds SET lock:inventory "client-abc-request-123" NX EX 30 # Returns:# - "OK" if lock acquired (key was created)# - nil if lock NOT acquired (key already exists) # RELEASE LOCK (Safe version using Lua script)# Only delete if we are still the holder EVAL " if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end" 1 lock:inventory client-abc-request-123 # Returns:# - 1 if lock released (we were holder)# - 0 if lock NOT released (we weren't holder, or already expired)Why the Lua Script for Release?
The release operation must be atomic: check if we're the holder AND delete the lock. Without atomicity, a race condition exists:
123456789101112131415161718
Timeline WITHOUT atomic release: Client A Redis Client B──────── ───── ────────GET lock:inventory→ "client-a-123" (comparing values...) (lock expires!) DEL lock:inventory SET lock:inventory (A's TTL expired) "client-b-456" NX EX 30 → OK (acquired)DEL lock:inventory→ 1 (DELETED!) Lock deleted (thinks it has lock) Result: A deleted B's lock. B continues unaware it lost the lock. The Lua script prevents this by making check-and-delete atomic.A single Redis instance has no replication and no consensus. If Redis crashes, all lock state is lost. If Redis has a network partition from some clients, those clients cannot acquire locks while others still can. Single-instance Redis locks are only appropriate for efficiency locks where occasional double-granting is acceptable.
Redlock was designed to address single-instance limitations by using N independent Redis masters (typically 5) and requiring a majority quorum for lock acquisition.
Key Insight: If you acquire a lock on a majority of independent servers, and each has its own clock and failure mode, the probability of all failing in a way that violates safety should be low—at least, that's the theory.
The Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
REDLOCK ALGORITHM (N=5 independent Redis instances):━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ PARAMETERS:- N = 5 (number of Redis instances, should be odd)- TTL = Lock validity time (e.g., 10 seconds)- Quorum = N/2 + 1 = 3 (majority)- Clock drift factor = 0.01 (1% assumed max clock drift) ACQUIRE LOCK:━━━━━━━━━━━━ 1. Get current time T1 (in milliseconds) 2. Attempt to acquire lock on ALL N instances sequentially: For each Redis instance i (from 1 to 5): - SET resource_name random_value NX PX (TTL in ms) - Use very short timeout per instance (e.g., 5-50ms) - If instance unreachable, move on immediately 3. Get current time T2 after all attempts 4. Calculate elapsed time: elapsed = T2 - T1 5. Calculate validity time: validity = TTL - elapsed - (TTL * clock_drift_factor) Example: TTL=10000ms, elapsed=50ms, drift=1% validity = 10000 - 50 - 100 = 9850ms 6. Check success criteria: - Did we acquire on MAJORITY (≥ 3) instances? AND - Is validity > 0? If YES: Lock acquired! Use for at most 'validity' milliseconds. If NO: Lock NOT acquired. Go to RELEASE. RELEASE LOCK:━━━━━━━━━━━━ 1. Attempt to release on ALL N instances (not just ones we acquired from): For each Redis instance i: - Run release Lua script with our random_value - Ignore failures/timeouts Why release on all? We may have acquired on an instance but think we didn't due to network issues when reading the response.Visual Example:
123456789101112131415161718192021222324252627
5 Independent Redis Instances: R1, R2, R3, R4, R5 Client A attempts lock with TTL=10s: Time R1 R2 R3 R4 R5──── ── ── ── ── ──t0 SET OK ✓ SET OK ✓ SET OK ✓ (timeout) SET OK ✓ 5ms 10ms 8ms 50ms 12ms Elapsed: 85ms totalSuccessful instances: 4 (R1, R2, R3, R5) ← Majority achieved!Validity: 10000 - 85 - 100 = 9815ms Result: Lock ACQUIRED on 4/5 instances. Client A may hold lock for up to 9815ms. Client B attempts same lock (while A holds): Time R1 R2 R3 R4 R5──── ── ── ── ── ──t100 SET nil ✗ SET nil ✗ SET nil ✗ SET OK ✓ SET nil ✗ (A's lock) (A's lock) (A's lock) (was down) (A's lock) Successful instances: 1 (only R4)1 < 3 (majority) → Lock NOT acquiredClient B must retry after delay.Redlock requires truly independent Redis instances—not replicated masters. Redis replication is asynchronous; a write to master might not reach replicas before failover. If you used a Redis Cluster, a master failure could lose lock state. Independent instances ensure no single point of failure and no replication lag issues.
Redlock's correctness depends critically on bounded clock behavior. This is the heart of Kleppmann's critique.
The Timing Assumption:
Redlock assumes that:
clock_drift_factor)If any of these assumptions are violated, safety can be violated.
12345678910111213141516171819202122232425262728293031323334353637
SCENARIO: GC Pause Causes Safety Violation Timeline:──────── t0: Client A acquires lock on R1, R2, R3 (majority) TTL = 10 seconds Validity = 9.8 seconds t1: Client A begins critical section t2: Client A enters LONG GC PAUSE (e.g., 15 seconds) (A is frozen, time appears stopped from A's perspective) t10: Lock expires on R1, R2, R3 (TTL reached) A is still frozen in GC t11: Client B acquires lock on R1, R2, R3 B begins critical section t15: Client A's GC completes A checks: "Do I still have validity time left?" A's calculation: t_now (t15) - t_acquired (t0) = 15 seconds Wait, that's > TTL... but A might not check properly! Or worse: A's local clock was also affected by the pause, so A believes only 1 second has passed. A proceeds with critical section. t16: BOTH A AND B ARE IN CRITICAL SECTION Mutual exclusion violated. THE PROBLEM:Redlock's validity calculation happens at acquire time.If a pause occurs AFTER acquire but BEFORE checking validity,the client can believe it still holds a lock that has expired.The fundamental issue is that Redlock assumes synchronous system behavior (bounded delays) but runs on asynchronous infrastructure where GC pauses, network delays, and CPU throttling can cause unbounded delays. These delays can happen at the worst possible time—between checking lock validity and performing the critical operation.
Why Zookeeper/etcd Don't Have This Problem:
Consensus-based systems solve this differently:
In February 2016, Martin Kleppmann published "How to do distributed locking" which systematically critiqued Redlock. Salvatore Sanfilippo (antirez) responded with a rebuttal. The exchange illuminated fundamental questions about distributed locking.
Kleppmann's Main Arguments:
Sanfilippo's Defense:
The distributed systems community largely sided with Kleppmann. The general consensus is: (1) For efficiency locks, Redlock adds complexity over single-instance Redis without clear benefit. (2) For correctness locks, use a consensus-based system like Zookeeper or etcd. Redlock sits in an awkward middle ground.
123456789101112131415161718192021222324
THE CENTRAL DISAGREEMENT:━━━━━━━━━━━━━━━━━━━━━━━━━━ Kleppmann's Position: "A distributed lock algorithm MUST be correct under asynchronous network assumptions. Redlock is not, because it relies on synchronized clocks and bounded process execution." Sanfilippo's Position: "Real systems are quasi-synchronous. With reasonable bounds on clock drift and process pauses, Redlock provides practical safety." Underlying Philosophical Difference: Kleppmann: Safety guarantees should be mathematical, not probabilistic. Sanfilippo: Probabilistic guarantees that work 99.999% of the time are sufficient for most use cases. Who's Right? Depends on your use case: - If failure means data corruption: Kleppmann is right. - If failure means duplicate work: Sanfilippo may be right. The problem is that engineers often don't know which category they're in until disaster strikes.One mitigation for Redlock's timing issues is to combine it with fencing tokens. The idea: even if mutual exclusion fails due to timing, the protected resource can reject operations from stale lock holders.
Implementing Fencing with Redlock:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import timeimport redisfrom typing import Optional, Tuple class RedlockWithFencing: def __init__(self, redis_instances: list, ttl_ms: int = 10000): self.instances = [redis.Redis(host=h, port=p) for h, p in redis_instances] self.ttl_ms = ttl_ms self.quorum = len(self.instances) // 2 + 1 self.clock_drift_factor = 0.01 def acquire(self, resource: str, client_id: str) -> Optional[Tuple[int, float]]: """ Acquire lock and return (fencing_token, validity_ms) or None. The fencing token is the maximum lock_version across all instances. This provides weak monotonicity - not perfect, but better than nothing. """ start_time = time.time() * 1000 acquired_count = 0 max_version = 0 for instance in self.instances: try: # Use INCR to generate a version, then SET with that version version = instance.incr(f"{resource}:version") result = instance.set( resource, f"{client_id}:{version}", nx=True, px=self.ttl_ms ) if result: acquired_count += 1 max_version = max(max_version, version) except redis.RedisError: continue elapsed = time.time() * 1000 - start_time validity = self.ttl_ms - elapsed - (self.ttl_ms * self.clock_drift_factor) if acquired_count >= self.quorum and validity > 0: return (max_version, validity) # Failed to acquire, release any partial locks self.release(resource, client_id) return None def release(self, resource: str, client_id: str): """Release lock on all instances.""" release_script = """ local val = redis.call('get', KEYS[1]) if val and string.find(val, ARGV[1]) then return redis.call('del', KEYS[1]) end return 0 """ for instance in self.instances: try: instance.eval(release_script, 1, resource, client_id) except redis.RedisError: continue # Protected resource with fencingclass FencedDatabase: def __init__(self): self.highest_token = 0 def update(self, fencing_token: int, data: dict) -> bool: if fencing_token < self.highest_token: raise StaleTokenError( f"Token {fencing_token} < current {self.highest_token}" ) self.highest_token = fencing_token # Proceed with update self._commit(data) return TrueLimitations of Fencing with Redlock:
"If you use fencing tokens, Redlock becomes an optimization to reduce the number of times you get rejected by the resource. The actual safety comes from fencing, not from Redlock. At that point, a single Redis instance provides the same optimization with less complexity."
Despite the controversy, Redlock has its place. The key is matching the tool to the requirement.
Decision Framework:
| Use Case | Recommended Solution | Why |
|---|---|---|
| Cache warming coordination | Single Redis SETNX | Simple, low stakes |
| Prevent duplicate batch jobs | Single Redis or Redlock | Duplicate is wasteful but not catastrophic |
| Rate limiting coordination | Redis with Lua scripts | Approximate limits are fine |
| Inventory management | Zookeeper/etcd + fencing | Data integrity critical |
| Payment processing | Zookeeper/etcd + fencing | Financial correctness required |
| Database leader election | Zookeeper/etcd | Single leader mandatory |
| Distributed cron | Zookeeper/etcd or Redlock | Depends on job idempotency |
| File exclusive access | Zookeeper/etcd | Partial writes cause corruption |
For efficiency locks, a single Redis instance may be simpler and equally effective as Redlock. The additional complexity of Redlock (5 instances, quorum logic, validity calculation) provides marginal benefit for cases where occasional double-granting is acceptable anyway. If you need that marginal improvement, you might actually need a consensus system.
If you decide to use Redlock, use a well-tested client library rather than implementing the algorithm yourself.
Official and Popular Implementations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
const Redlock = require('redlock');const Redis = require('ioredis'); // Connect to 5 independent Redis instancesconst redisClients = [ new Redis({ host: 'redis1', port: 6379 }), new Redis({ host: 'redis2', port: 6379 }), new Redis({ host: 'redis3', port: 6379 }), new Redis({ host: 'redis4', port: 6379 }), new Redis({ host: 'redis5', port: 6379 }),]; const redlock = new Redlock(redisClients, { // Time in ms before retrying retryDelay: 200, // Max attempts to acquire lock retryCount: 10, // Clock drift factor (0.01 = 1%) driftFactor: 0.01,}); async function doExclusiveWork() { try { // Acquire lock for 10 seconds const lock = await redlock.acquire(['locks:my-resource'], 10000); console.log('Lock acquired:', lock.value); try { // CRITICAL SECTION // Use lock.expiration to check remaining time console.log('Expiration:', lock.expiration); await processExclusiveOperation(); } finally { // Release lock await lock.release(); console.log('Lock released'); } } catch (error) { if (error.name === 'LockError') { console.log('Could not acquire lock:', error.message); } else { throw error; } }} // Auto-extending locks (for long operations)async function doLongWork() { const lock = await redlock.acquire(['locks:long-operation'], 10000); // Extend lock every 5 seconds const extendInterval = setInterval(async () => { try { await lock.extend(10000); console.log('Lock extended'); } catch (e) { console.error('Failed to extend lock!'); clearInterval(extendInterval); } }, 5000); try { await veryLongOperation(); } finally { clearInterval(extendInterval); await lock.release(); }}Simpler Alternatives to Redlock:
If you're questioning whether Redlock is right for you, consider these alternatives:
| Alternative | When to Use | Trade-off |
|---|---|---|
| Single Redis SETNX | Efficiency locks, can tolerate Redis failure | Single point of failure |
| Redis Sentinel + SETNX | Efficiency locks with failover | Async replication can lose locks |
| PostgreSQL advisory locks | Already using PostgreSQL, single-process critical sections | Database as coordination service |
| etcd via K8s | Running Kubernetes | Adds dependency, but battle-tested |
| AWS DynamoDB locks | AWS environment | Managed service, conditional writes |
| GCP Spanner locks | GCP environment | External consistency, higher latency |
We've explored Redlock in depth—its algorithm, limitations, the famous debate, and practical guidance. Let's consolidate the key insights:
The Bottom Line:
Redlock occupies an uncomfortable middle ground:
For most engineers, the recommendation is clear:
You have completed the Distributed Locking module. You now understand why distributed locks are needed, the formal properties they must satisfy, how Zookeeper and etcd implement correct locks, and the trade-offs of the Redis Redlock algorithm. You're equipped to make informed decisions about distributed locking in your systems—choosing the right tool based on whether you need efficiency or correctness, and understanding the failure modes of each approach.
Recommended Reading: