System Design (HLD)Distributed Locking

Distributed Locking

LevelAdvanced

Duration90 mins

TopicDistributed Locking

1 / 5

Why Distributed Locking Is Needed

The Silent Corruption Problem

In 2012, a major e-commerce platform experienced a catastrophic inventory management failure during Black Friday. Despite showing "in stock" for thousands of orders, the actual inventory had long been depleted. The result: $47 million in over-sold merchandise, angry customers, and a logistics nightmare that took weeks to untangle.

The root cause? Multiple order processing servers simultaneously read the same inventory count, each believed items were available, and each deducted from the count independently. Without proper coordination, hundreds of concurrent purchases competed for the same limited inventory—and every one of them won, at least from the system's perspective.

This is the distributed locking problem in its most visceral form: when multiple processes across different machines need to safely access and modify shared resources, the absence of coordination leads to corruption.

What You Will Learn

By the end of this page, you will understand why distributed locking is not just a theoretical concern but a fundamental requirement for building reliable distributed systems. You'll grasp the difference between local and distributed coordination, recognize the scenarios that demand distributed locks, and appreciate the inherent challenges that make this problem so difficult to solve correctly.

From Single-Machine to Distributed Coordination

Before understanding distributed locks, we must first appreciate how coordination works in single-machine environments—and why those mechanisms catastrophically fail when we scale beyond a single server.

The Single-Machine Golden Age:

In a single-machine environment, coordination is relatively straightforward. Operating systems provide robust primitives for synchronization:

Single-Machine Synchronization Primitives

•Mutexes (Mutual Exclusion Locks) — Kernel-level constructs that ensure only one thread at a time can hold the lock. If another thread tries to acquire a held mutex, it blocks until the lock is released.
•Semaphores — Generalized counters that allow a limited number of concurrent accesses, useful for scenarios like connection pooling or rate limiting.
•Read-Write Locks — Optimized primitives that allow multiple concurrent readers but exclusive access for writers—perfect for read-heavy workloads.
•Condition Variables — Signaling mechanisms that allow threads to wait for specific conditions to become true, enabling complex coordination patterns.
•Compare-and-Swap (CAS) — CPU-level atomic instructions that enable lock-free data structures through optimistic concurrency control.

These primitives work because they rely on shared memory and a single source of truth. The operating system kernel arbitrates access, and all threads agree on the state of the world because they're all looking at the same memory.

Why This Breaks in Distributed Systems:

When your application spans multiple servers, the fundamental assumptions collapse:

Single-Machine vs. Distributed Coordination
Aspect	Single Machine	Distributed System
Memory	Shared by all threads	Private to each server—no shared memory
Time	Single clock, consistent view	Multiple clocks, clock skew, no agreed 'now'
Failure Model	Process crash = all-or-nothing	Partial failures: some nodes up, some down
Communication	Function calls, nanoseconds	Network messages, milliseconds, may fail
Ordering	Total ordering via OS scheduler	No inherent ordering of events across nodes
Arbitration	Kernel decides who gets lock	No single arbiter—must coordinate ourselves

The Core Challenge

In a distributed system, there is no shared memory, no global clock, and no central arbiter. Yet we still need processes on different machines to coordinate access to shared resources. This is the distributed locking problem: achieving mutual exclusion without the primitives that make local mutual exclusion straightforward.

The Race Condition Problem Space

Race conditions in distributed systems are particularly insidious because they're often non-deterministic and hard to reproduce. A system might run perfectly for months, then fail spectacularly under slightly higher load or slightly different timing.

Anatomy of a Distributed Race Condition:

Consider a simple read-modify-write operation like incrementing a counter. In a single-threaded world, this is trivial:

1
2
3
4
# Single-threaded: No problem
counter = get_counter_from_database()  # Read: counter = 100
counter = counter + 1                   # Modify: counter = 101
save_counter_to_database(counter)       # Write: saved 101

Now imagine two servers (A and B) execute this code simultaneously for the same counter:

1
2
3
4
5
6
7
8
9
10
11
12
Time    Server A                    Server B                    Database
----    --------                    --------                    --------
t1      read counter → 100          (waiting)                   counter = 100
t2      (processing)                read counter → 100          counter = 100
t3      counter = 100 + 1 = 101     (processing)                counter = 100
t4      write counter = 101         counter = 100 + 1 = 101     counter = 101
t5      (done)                      write counter = 101         counter = 101
 
Result: Two increments, but counter only increased by 1.
Expected: counter = 102
Actual: counter = 101
Lost update: 1 increment was silently dropped

Lost Updates Are Silent Corruption

The terrifying aspect of lost updates is that no error is raised. The system doesn't crash. Both servers believe they succeeded. The corruption accumulates silently until someone notices that numbers don't add up—often long after the transactions are irreversible.

Categories of Distributed Race Conditions:

Distributed race conditions manifest in several patterns, each with its own characteristics:

Race Condition Categories

•Lost Updates — Multiple processes read the same value, compute new values independently, and write back. All but one write is lost. Example: Inventory over-selling.
•Dirty Reads — A process reads data that another process is in the middle of modifying, seeing an inconsistent intermediate state. Example: Reading partially updated user profile.
•Double Processing — An operation that should execute exactly once runs multiple times. Example: Charging a customer's credit card twice for a single order.
•Phantom Operations — Decisions made based on data that changes before the decision is fully executed. Example: Approving a withdrawal when the account balance has since changed.
•Write Skew — Two transactions read overlapping data, make non-overlapping updates, but the combination violates an invariant. Example: Two doctors simultaneously scheduling themselves off-call when one must remain on-call.

Real-World Scenarios Demanding Distributed Locks

Distributed locking isn't an abstract computer science problem—it's a practical requirement across virtually every industry. Let's examine concrete scenarios where distributed locks are essential.

Scenario 1: E-Commerce Inventory Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Flash sale: 100 units available, 10,000 customers competing
# Without distributed locking:
 
async def process_order(product_id, customer_id, quantity):
    # Multiple servers execute this concurrently
    available = await db.get_inventory(product_id)  # All see: 100 units
    
    if available >= quantity:
        # Hundreds of customers pass this check simultaneously
        await db.decrement_inventory(product_id, quantity)
        await create_order(customer_id, product_id, quantity)
        return "Order placed!"
    return "Out of stock"
 
# Result: 500 orders placed for 100 units of inventory
# Every customer was promised their order. Chaos ensues.

Scenario 2: Leader Election in Database Replicas

In a replicated database system, exactly one node must be the leader that accepts writes. If two nodes simultaneously believe they are the leader, split-brain occurs—both accept writes, and the data diverges irreconcilably.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Database replica cluster: nodes A, B, C
# Leader election without proper locking:
 
async def try_become_leader(node_id):
    current_leader = await check_leader_status()  # Network partition: stale data
    
    if current_leader is None or leader_seems_dead():
        await register_as_leader(node_id)  # Both A and B do this!
        return True
    return False
 
# Result: Both A and B think they are leader
# Client writes go to A: SET user:123:balance = 500
# Client writes go to B: SET user:123:balance = 750
# When partition heals: Which balance is correct? Data is corrupted.

Scenario 3: Scheduled Task Execution (Cron at Scale)

When you run a distributed application across multiple servers, scheduled tasks (like nightly billing runs) must execute exactly once—not once per server.

Critical Scenarios Requiring Distributed Locks
Scenario	Without Lock	With Lock	Business Impact
Payment Processing	Double charges	Single atomic charge	$M in refunds, trust loss
File Processing	Corrupt files from concurrent edits	Sequential access	Data integrity maintained
API Rate Limiting	Quota exceeded unfairly	Accurate global counts	Fair resource usage
Cache Warming	Thundering herd on cache miss	Single fetcher	Database protected
Workflow Orchestration	Duplicate task execution	Exactly-once processing	Reliable automation
Configuration Updates	Inconsistent config states	Atomic config rollout	System stability

The Distributed Lock Decision Matrix

You need a distributed lock when: (1) Multiple processes across different machines access the same resource, (2) Operations are not idempotent (repeating them causes harm), and (3) Concurrent execution leads to an invalid state. If all three conditions are true, distributed locking is not optional—it's required for correctness.

Why Local Solutions Fail

When engineers first encounter distributed race conditions, the instinct is often to apply familiar solutions. Unfortunately, local concurrency solutions don't survive the transition to distributed systems.

Failed Approach 1: Database Row Locks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Attempt: Use SELECT ... FOR UPDATE to lock the row
BEGIN TRANSACTION;
 
SELECT inventory FROM products WHERE id = 'product_123' FOR UPDATE;
-- This locks the row, right?
 
-- Problem 1: Lock is held only during this transaction
-- If application crashes between read and commit, lock is released
-- but external systems (payment, shipping) may have already been notified
 
-- Problem 2: Connection pooling means different requests might
-- use different connections, and locks are connection-scoped
 
-- Problem 3: Long-running external calls while holding lock
-- leads to lock contention, timeouts, and eventually deadlocks
 
UPDATE products SET inventory = inventory - 1 WHERE id = 'product_123';
COMMIT;

Row-level locks work for pure-database operations but fail when:

The critical section includes external service calls
Application crashes can leave work partially done
Lock duration is unpredictable due to network latency
Multiple databases need coordinated access

Failed Approach 2: Optimistic Locking with Version Numbers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
async def update_with_optimistic_lock(product_id, quantity):
    max_retries = 5
    for attempt in range(max_retries):
        # Read current value and version
        product = await db.get(product_id)
        current_version = product.version
        
        if product.inventory < quantity:
            return "Out of stock"
        
        # Attempt update with version check
        result = await db.update(
            product_id,
            {"inventory": product.inventory - quantity},
            where={"version": current_version}  # Only if version unchanged
        )
        
        if result.modified_count == 1:
            return "Success"
        
        # Version changed, someone else modified it, retry
        await asyncio.sleep(random.uniform(0.01, 0.1))
    
    return "Too many conflicts, please try again"
 
# Problems:
# 1. Under high contention, most retries fail → wasted work
# 2. No fairness: early requests can starve
# 3. Doesn't prevent read of stale data for business logic
# 4. Complexity explodes for multi-resource operations

Optimistic locking works well for low-contention scenarios but degrades rapidly when many processes compete for the same resource. In flash sale scenarios, retry storms can overwhelm the database.

Failed Approach 3: Application-Level Flags

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Attempt: Use a shared cache to track who has the lock
 
async def acquire_lock(resource_id, holder_id):
    current_holder = await redis.get(f"lock:{resource_id}")
    
    if current_holder is None:
        # RACE CONDITION: Between this check and the set,
        # another process can also see None and set its own holder
        await redis.set(f"lock:{resource_id}", holder_id, ex=30)
        return True
    
    return False
 
# This is the EXACT same race condition we're trying to solve!
# The "lock" mechanism itself has a race condition.
# Two processes can both see None and both think they acquired the lock.

The Bootstrapping Problem

You cannot build a distributed lock using non-atomic operations. Any check-then-act sequence is vulnerable to the race condition it's trying to prevent. This is why distributed locks require atomic primitives provided by specialized coordination services.

Fundamental Challenges of Distributed Locking

Distributed locking is difficult because it must solve coordination in an environment fundamentally hostile to coordination. Let's examine the core challenges that make this problem so hard.

Challenge 1: Partial Failures and Network Partitions

In a single-machine system, failure is binary: either everything works, or the machine is off. In distributed systems, failure is partial: some nodes work, some fail, and communication between any pair may or may not succeed.

Consider this nightmare scenario:

1
2
3
4
5
6
7
8
9
10
11
12
Timeline:
t0: Process A acquires lock on Resource X (success)
t1: Process A starts critical section work
t2: Network partition isolates Process A from lock service
t3: Lock service cannot contact A, assumes A is dead
t4: Lock service grants lock to Process B (from its perspective, lock is free)
t5: Process B starts critical section work
t6: Network heals—both A and B are now in critical section simultaneously!
 
Both processes believe they legitimately hold the lock.
Both are modifying the shared resource.
Mutual exclusion has been violated.

Challenge 2: Clock Synchronization and Time-Based Expiry

Locks typically have expiration times to prevent deadlock when holders crash. But distributed systems have no shared clock:

Time-Related Lock Challenges

•Clock Drift — Physical clocks on different machines drift apart. A lock expiring at "10:00:00" might expire at 10:00:00.000 on one machine and 10:00:00.500 on another.
•NTP Jumps — Network Time Protocol can cause clocks to jump forward or backward, potentially causing locks to expire early or last longer than expected.
•GC Pauses — Garbage collection can pause processes for seconds. A process might believe its lock is valid, but the pause caused it to expire long ago.
•CPU Throttling — Cloud VMs may be throttled, causing a process to experience "time dilation" where wall-clock time passes faster than the process perceives.

Challenge 3: Consensus Requirements

For a distributed lock to be safe, all participants must agree on who holds the lock at any given time. This is fundamentally a consensus problem, which is provably impossible to solve deterministically in asynchronous systems with failures (FLP impossibility).

Practical lock services circumvent FLP by:

Using timeouts (giving up on termination guarantee)
Requiring a majority quorum (tolerating minority failures)
Accepting that safety may be violated during extreme network conditions

No Perfect Distributed Lock Exists

Every distributed lock implementation makes trade-offs. Some prioritize safety (never granting lock to two holders) at the cost of availability. Others prioritize availability at the cost of occasionally violating mutual exclusion. Understanding these trade-offs is essential for choosing the right solution.

Fencing Tokens: A Crucial Defense Mechanism

Given the impossibility of perfect distributed locks, how do sophisticated systems protect against safety violations? The key innovation is fencing tokens—a mechanism that allows the protected resource to detect and reject stale lock holders.

How Fencing Tokens Work:

When a lock is granted, it includes a monotonically increasing number (the fencing token). Each time the lock is acquired, the token increases. The protected resource tracks the highest token it has seen and rejects requests with lower tokens.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Lock Service
class DistributedLock:
    def __init__(self):
        self.current_token = 0
    
    def acquire(self, client_id):
        # ... consensus/election logic ...
        self.current_token += 1
        return LockGrant(token=self.current_token, holder=client_id)
 
# Protected Resource (e.g., a database)
class FencedResource:
    def __init__(self):
        self.highest_token_seen = 0
    
    def update(self, token, data):
        if token < self.highest_token_seen:
            raise StaleTokenError(f"Token {token} is older than {self.highest_token_seen}")
        
        self.highest_token_seen = token
        # Proceed with update
        return self._do_update(data)

1
2
3
4
5
6
7
8
9
10
11
Timeline with Fencing:
t0: Process A acquires lock, receives token=33
t1: Process A starts work, sends update with token=33 to resource
t2: Network partition isolates Process A
t3: Lock expires, Process B acquires lock, receives token=34
t4: Process B sends update with token=34 to resource
t5: Resource accepts B's update, records highest_token=34
t6: Network heals, Process A (delayed) sends update with token=33
t7: Resource REJECTS A's update: 33 < 34 (stale token)
 
Safety preserved! The resource detected the stale lock holder.

Defense in Depth

Fencing tokens don't prevent lock safety violations in the lock service—they prevent those violations from corrupting the protected resource. This is defense in depth: the lock provides mutual exclusion as a first line of defense, and fencing tokens provide a safety net when that defense fails.

Limitations of Fencing Tokens:

Fencing requires the protected resource to participate in the protocol:

The resource must track and compare tokens
All writers must include their token in every operation
Resources that don't support token checking (legacy systems, third-party APIs) cannot be fenced

This is why distributed locks alone, without fencing, should be treated as efficiency optimizations rather than correctness guarantees.

Efficiency Locks vs. Correctness Locks

Martin Kleppmann, in his famous analysis of distributed locks, draws a crucial distinction between two use cases:

Efficiency Locks (Best Effort):

These locks prevent duplicate work but can tolerate occasional violations. If two processes occasionally do the same work, the result is wasted resources, not data corruption.

Efficiency Lock Examples

•Cache warming — Only one server should fetch from the slow backend, but if two do, the result is the same data cached
•Batch job deduplication — Ideally only one server processes the nightly report, but if two generate the same report, we just delete one
•Log aggregation — One server should aggregate logs for this minute, but duplicate aggregation produces the same output
•Connection pool limiting — We want to limit concurrent connections, but briefly exceeding the limit is acceptable

Correctness Locks (Must Be Perfect):

These locks protect invariants where any violation causes data corruption or incorrect outcomes. Even a single violation is unacceptable.

Correctness Lock Examples

•Financial transactions — Double-charging a customer is not "extra work"; it's money stolen
•Inventory management — Selling more items than exist creates real-world obligations that cannot be unwound
•Leader election — Two leaders accepting writes causes permanent data divergence
•Account balance updates — Concurrent withdrawals must never let balance go negative if that's an invariant

Choose Your Guarantees Wisely

For efficiency locks, simpler implementations (even Redis SETNX) may suffice. For correctness locks, you need either (1) a consensus-based lock service with proper failure handling, or (2) a lock combined with fencing tokens, or (3) alternative approaches like single-writer designs or CRDTs. Using an efficiency lock when you need a correctness lock is a recipe for silent data corruption.

Choosing Lock Type by Use Case
Aspect	Efficiency Lock	Correctness Lock
Violation consequence	Wasted resources	Data corruption
Acceptable failure rate	1 in 1000, 1 in 10000	Never acceptable
Implementation complexity	Simple (Redis SETNX)	Complex (Zookeeper/etcd + fencing)
Performance overhead	Low	Higher (consensus round-trips)
Recovery from violation	Retry, self-healing	Manual intervention, auditing

Summary: The Distributed Locking Imperative

We've established the foundational understanding of why distributed locking is a critical requirement for reliable distributed systems. Let's consolidate the key insights:

Key Takeaways

•Local primitives don't work — Mutexes, semaphores, and other single-machine constructs fail because distributed systems lack shared memory and have partial failures.
•Race conditions are silent killers — Lost updates, double processing, and split-brain don't raise errors; they corrupt data silently.
•Distributed locks solve real problems — From inventory management to leader election, countless real-world scenarios require coordinated access.
•The problem is fundamentally hard — Partial failures, clock skew, and consensus requirements make perfect distributed locking theoretically impossible.
•Fencing tokens add defense in depth — The protected resource can reject stale lock holders, even when the lock itself fails.
•Distinguish efficiency from correctness — Different use cases tolerate different failure rates; choose your implementation accordingly.

What's Next:

Now that we understand why distributed locking is needed, we'll dive into the formal requirements that any distributed lock must satisfy. The next page explores mutual exclusion, deadlock freedom, and liveness guarantees—the properties that separate working lock implementations from broken ones.

Page Complete

You now understand the fundamental need for distributed locking: the collapse of local coordination in distributed environments, the race conditions that emerge, and the real-world scenarios where locks are essential. You also understand why this problem is hard and why perfect solutions don't exist. Next, we'll formalize what properties a distributed lock must have to be useful.

1 / 5

Loading learning content...

System Design (HLD)Distributed Locking

Distributed Locking

LevelAdvanced

Duration90 mins

TopicDistributed Locking

1 / 5

Why Distributed Locking Is Needed

The Silent Corruption Problem

What You Will Learn

From Single-Machine to Distributed Coordination

The Single-Machine Golden Age:

In a single-machine environment, coordination is relatively straightforward. Operating systems provide robust primitives for synchronization:

Single-Machine Synchronization Primitives

•Mutexes (Mutual Exclusion Locks) — Kernel-level constructs that ensure only one thread at a time can hold the lock. If another thread tries to acquire a held mutex, it blocks until the lock is released.
•Semaphores — Generalized counters that allow a limited number of concurrent accesses, useful for scenarios like connection pooling or rate limiting.
•Read-Write Locks — Optimized primitives that allow multiple concurrent readers but exclusive access for writers—perfect for read-heavy workloads.
•Condition Variables — Signaling mechanisms that allow threads to wait for specific conditions to become true, enabling complex coordination patterns.
•Compare-and-Swap (CAS) — CPU-level atomic instructions that enable lock-free data structures through optimistic concurrency control.

Why This Breaks in Distributed Systems:

When your application spans multiple servers, the fundamental assumptions collapse:

Single-Machine vs. Distributed Coordination
Aspect	Single Machine	Distributed System
Memory	Shared by all threads	Private to each server—no shared memory
Time	Single clock, consistent view	Multiple clocks, clock skew, no agreed 'now'
Failure Model	Process crash = all-or-nothing	Partial failures: some nodes up, some down
Communication	Function calls, nanoseconds	Network messages, milliseconds, may fail
Ordering	Total ordering via OS scheduler	No inherent ordering of events across nodes
Arbitration	Kernel decides who gets lock	No single arbiter—must coordinate ourselves

The Core Challenge

The Race Condition Problem Space

Anatomy of a Distributed Race Condition:

Consider a simple read-modify-write operation like incrementing a counter. In a single-threaded world, this is trivial:

1
2
3
4
# Single-threaded: No problem
counter = get_counter_from_database()  # Read: counter = 100
counter = counter + 1                   # Modify: counter = 101
save_counter_to_database(counter)       # Write: saved 101

Now imagine two servers (A and B) execute this code simultaneously for the same counter:

1
2
3
4
5
6
7
8
9
10
11
12
Time    Server A                    Server B                    Database
----    --------                    --------                    --------
t1      read counter → 100          (waiting)                   counter = 100
t2      (processing)                read counter → 100          counter = 100
t3      counter = 100 + 1 = 101     (processing)                counter = 100
t4      write counter = 101         counter = 100 + 1 = 101     counter = 101
t5      (done)                      write counter = 101         counter = 101
 
Result: Two increments, but counter only increased by 1.
Expected: counter = 102
Actual: counter = 101
Lost update: 1 increment was silently dropped

Lost Updates Are Silent Corruption

Categories of Distributed Race Conditions:

Distributed race conditions manifest in several patterns, each with its own characteristics:

Race Condition Categories

•Lost Updates — Multiple processes read the same value, compute new values independently, and write back. All but one write is lost. Example: Inventory over-selling.
•Dirty Reads — A process reads data that another process is in the middle of modifying, seeing an inconsistent intermediate state. Example: Reading partially updated user profile.
•Double Processing — An operation that should execute exactly once runs multiple times. Example: Charging a customer's credit card twice for a single order.
•Phantom Operations — Decisions made based on data that changes before the decision is fully executed. Example: Approving a withdrawal when the account balance has since changed.
•Write Skew — Two transactions read overlapping data, make non-overlapping updates, but the combination violates an invariant. Example: Two doctors simultaneously scheduling themselves off-call when one must remain on-call.

Real-World Scenarios Demanding Distributed Locks

Distributed locking isn't an abstract computer science problem—it's a practical requirement across virtually every industry. Let's examine concrete scenarios where distributed locks are essential.

Scenario 1: E-Commerce Inventory Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Flash sale: 100 units available, 10,000 customers competing
# Without distributed locking:
 
async def process_order(product_id, customer_id, quantity):
    # Multiple servers execute this concurrently
    available = await db.get_inventory(product_id)  # All see: 100 units
    
    if available >= quantity:
        # Hundreds of customers pass this check simultaneously
        await db.decrement_inventory(product_id, quantity)
        await create_order(customer_id, product_id, quantity)
        return "Order placed!"
    return "Out of stock"
 
# Result: 500 orders placed for 100 units of inventory
# Every customer was promised their order. Chaos ensues.

Scenario 2: Leader Election in Database Replicas

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Database replica cluster: nodes A, B, C
# Leader election without proper locking:
 
async def try_become_leader(node_id):
    current_leader = await check_leader_status()  # Network partition: stale data
    
    if current_leader is None or leader_seems_dead():
        await register_as_leader(node_id)  # Both A and B do this!
        return True
    return False
 
# Result: Both A and B think they are leader
# Client writes go to A: SET user:123:balance = 500
# Client writes go to B: SET user:123:balance = 750
# When partition heals: Which balance is correct? Data is corrupted.

Scenario 3: Scheduled Task Execution (Cron at Scale)

When you run a distributed application across multiple servers, scheduled tasks (like nightly billing runs) must execute exactly once—not once per server.

Critical Scenarios Requiring Distributed Locks
Scenario	Without Lock	With Lock	Business Impact
Payment Processing	Double charges	Single atomic charge	$M in refunds, trust loss
File Processing	Corrupt files from concurrent edits	Sequential access	Data integrity maintained
API Rate Limiting	Quota exceeded unfairly	Accurate global counts	Fair resource usage
Cache Warming	Thundering herd on cache miss	Single fetcher	Database protected
Workflow Orchestration	Duplicate task execution	Exactly-once processing	Reliable automation
Configuration Updates	Inconsistent config states	Atomic config rollout	System stability

The Distributed Lock Decision Matrix

Why Local Solutions Fail

Failed Approach 1: Database Row Locks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Attempt: Use SELECT ... FOR UPDATE to lock the row
BEGIN TRANSACTION;
 
SELECT inventory FROM products WHERE id = 'product_123' FOR UPDATE;
-- This locks the row, right?
 
-- Problem 1: Lock is held only during this transaction
-- If application crashes between read and commit, lock is released
-- but external systems (payment, shipping) may have already been notified
 
-- Problem 2: Connection pooling means different requests might
-- use different connections, and locks are connection-scoped
 
-- Problem 3: Long-running external calls while holding lock
-- leads to lock contention, timeouts, and eventually deadlocks
 
UPDATE products SET inventory = inventory - 1 WHERE id = 'product_123';
COMMIT;

Row-level locks work for pure-database operations but fail when:

The critical section includes external service calls
Application crashes can leave work partially done
Lock duration is unpredictable due to network latency
Multiple databases need coordinated access

Failed Approach 2: Optimistic Locking with Version Numbers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
async def update_with_optimistic_lock(product_id, quantity):
    max_retries = 5
    for attempt in range(max_retries):
        # Read current value and version
        product = await db.get(product_id)
        current_version = product.version
        
        if product.inventory < quantity:
            return "Out of stock"
        
        # Attempt update with version check
        result = await db.update(
            product_id,
            {"inventory": product.inventory - quantity},
            where={"version": current_version}  # Only if version unchanged
        )
        
        if result.modified_count == 1:
            return "Success"
        
        # Version changed, someone else modified it, retry
        await asyncio.sleep(random.uniform(0.01, 0.1))
    
    return "Too many conflicts, please try again"
 
# Problems:
# 1. Under high contention, most retries fail → wasted work
# 2. No fairness: early requests can starve
# 3. Doesn't prevent read of stale data for business logic
# 4. Complexity explodes for multi-resource operations

Failed Approach 3: Application-Level Flags

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Attempt: Use a shared cache to track who has the lock
 
async def acquire_lock(resource_id, holder_id):
    current_holder = await redis.get(f"lock:{resource_id}")
    
    if current_holder is None:
        # RACE CONDITION: Between this check and the set,
        # another process can also see None and set its own holder
        await redis.set(f"lock:{resource_id}", holder_id, ex=30)
        return True
    
    return False
 
# This is the EXACT same race condition we're trying to solve!
# The "lock" mechanism itself has a race condition.
# Two processes can both see None and both think they acquired the lock.

The Bootstrapping Problem

Fundamental Challenges of Distributed Locking

Distributed locking is difficult because it must solve coordination in an environment fundamentally hostile to coordination. Let's examine the core challenges that make this problem so hard.

Challenge 1: Partial Failures and Network Partitions

Consider this nightmare scenario:

1
2
3
4
5
6
7
8
9
10
11
12
Timeline:
t0: Process A acquires lock on Resource X (success)
t1: Process A starts critical section work
t2: Network partition isolates Process A from lock service
t3: Lock service cannot contact A, assumes A is dead
t4: Lock service grants lock to Process B (from its perspective, lock is free)
t5: Process B starts critical section work
t6: Network heals—both A and B are now in critical section simultaneously!
 
Both processes believe they legitimately hold the lock.
Both are modifying the shared resource.
Mutual exclusion has been violated.

Challenge 2: Clock Synchronization and Time-Based Expiry

Locks typically have expiration times to prevent deadlock when holders crash. But distributed systems have no shared clock:

Time-Related Lock Challenges

•Clock Drift — Physical clocks on different machines drift apart. A lock expiring at "10:00:00" might expire at 10:00:00.000 on one machine and 10:00:00.500 on another.
•NTP Jumps — Network Time Protocol can cause clocks to jump forward or backward, potentially causing locks to expire early or last longer than expected.
•GC Pauses — Garbage collection can pause processes for seconds. A process might believe its lock is valid, but the pause caused it to expire long ago.
•CPU Throttling — Cloud VMs may be throttled, causing a process to experience "time dilation" where wall-clock time passes faster than the process perceives.

Challenge 3: Consensus Requirements

Practical lock services circumvent FLP by:

Using timeouts (giving up on termination guarantee)
Requiring a majority quorum (tolerating minority failures)
Accepting that safety may be violated during extreme network conditions

No Perfect Distributed Lock Exists

Fencing Tokens: A Crucial Defense Mechanism

How Fencing Tokens Work:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Lock Service
class DistributedLock:
    def __init__(self):
        self.current_token = 0
    
    def acquire(self, client_id):
        # ... consensus/election logic ...
        self.current_token += 1
        return LockGrant(token=self.current_token, holder=client_id)
 
# Protected Resource (e.g., a database)
class FencedResource:
    def __init__(self):
        self.highest_token_seen = 0
    
    def update(self, token, data):
        if token < self.highest_token_seen:
            raise StaleTokenError(f"Token {token} is older than {self.highest_token_seen}")
        
        self.highest_token_seen = token
        # Proceed with update
        return self._do_update(data)

1
2
3
4
5
6
7
8
9
10
11
Timeline with Fencing:
t0: Process A acquires lock, receives token=33
t1: Process A starts work, sends update with token=33 to resource
t2: Network partition isolates Process A
t3: Lock expires, Process B acquires lock, receives token=34
t4: Process B sends update with token=34 to resource
t5: Resource accepts B's update, records highest_token=34
t6: Network heals, Process A (delayed) sends update with token=33
t7: Resource REJECTS A's update: 33 < 34 (stale token)
 
Safety preserved! The resource detected the stale lock holder.

Defense in Depth

Limitations of Fencing Tokens:

Fencing requires the protected resource to participate in the protocol:

The resource must track and compare tokens
All writers must include their token in every operation
Resources that don't support token checking (legacy systems, third-party APIs) cannot be fenced

This is why distributed locks alone, without fencing, should be treated as efficiency optimizations rather than correctness guarantees.

Efficiency Locks vs. Correctness Locks

Martin Kleppmann, in his famous analysis of distributed locks, draws a crucial distinction between two use cases:

Efficiency Locks (Best Effort):

These locks prevent duplicate work but can tolerate occasional violations. If two processes occasionally do the same work, the result is wasted resources, not data corruption.

Efficiency Lock Examples

•Cache warming — Only one server should fetch from the slow backend, but if two do, the result is the same data cached
•Batch job deduplication — Ideally only one server processes the nightly report, but if two generate the same report, we just delete one
•Log aggregation — One server should aggregate logs for this minute, but duplicate aggregation produces the same output
•Connection pool limiting — We want to limit concurrent connections, but briefly exceeding the limit is acceptable

Correctness Locks (Must Be Perfect):

These locks protect invariants where any violation causes data corruption or incorrect outcomes. Even a single violation is unacceptable.

Correctness Lock Examples

•Financial transactions — Double-charging a customer is not "extra work"; it's money stolen
•Inventory management — Selling more items than exist creates real-world obligations that cannot be unwound
•Leader election — Two leaders accepting writes causes permanent data divergence
•Account balance updates — Concurrent withdrawals must never let balance go negative if that's an invariant

Choose Your Guarantees Wisely

Choosing Lock Type by Use Case
Aspect	Efficiency Lock	Correctness Lock
Violation consequence	Wasted resources	Data corruption
Acceptable failure rate	1 in 1000, 1 in 10000	Never acceptable
Implementation complexity	Simple (Redis SETNX)	Complex (Zookeeper/etcd + fencing)
Performance overhead	Low	Higher (consensus round-trips)
Recovery from violation	Retry, self-healing	Manual intervention, auditing

Summary: The Distributed Locking Imperative

We've established the foundational understanding of why distributed locking is a critical requirement for reliable distributed systems. Let's consolidate the key insights:

Key Takeaways

•Local primitives don't work — Mutexes, semaphores, and other single-machine constructs fail because distributed systems lack shared memory and have partial failures.
•Race conditions are silent killers — Lost updates, double processing, and split-brain don't raise errors; they corrupt data silently.
•Distributed locks solve real problems — From inventory management to leader election, countless real-world scenarios require coordinated access.
•The problem is fundamentally hard — Partial failures, clock skew, and consensus requirements make perfect distributed locking theoretically impossible.
•Fencing tokens add defense in depth — The protected resource can reject stale lock holders, even when the lock itself fails.
•Distinguish efficiency from correctness — Different use cases tolerate different failure rates; choose your implementation accordingly.

What's Next:

Page Complete

1 / 5