Loading content...
Imagine renting an apartment: you pay for exclusive access for a fixed period. When the lease expires, you must renew or vacate—the landlord can give the apartment to someone else. This simple, familiar concept revolutionizes leader election in distributed systems.
Lease-based election abandons the voting-based approaches of Bully and Ring in favor of time-bounded leadership grants. A leader doesn't win an election through message exchange with peers; instead, it acquires a time-limited lease from a coordination service. As long as the lease is valid, the holder is the rightful leader. When the lease expires—whether due to failure, partition, or intentional non-renewal—leadership becomes available for another node to claim.
This approach elegantly solves the split-brain problem that plagues traditional election algorithms. Even if a partitioned leader continues operating, its lease eventually expires, and it must stop acting as leader regardless of what it believes. Time becomes the ultimate arbiter of leadership validity.
By the end of this page, you will understand how leases grant time-bounded exclusive access, the critical role of clock assumptions, lease acquisition and renewal protocols, how leases inherently handle network partitions, the concept of fencing tokens for safety, and practical considerations when implementing lease-based leader election.
A lease is a time-limited grant of exclusive access to a resource or role. In the context of leader election, the 'resource' is leadership itself. Let's formalize the properties of a lease:
Core lease properties:
Exclusivity: At most one holder at any time. If node A holds a valid lease, node B cannot hold a valid lease for the same resource.
Time-bounded: Every lease has an explicit expiration time. After expiration, the lease is no longer valid.
Renewable: Before expiration, the holder can request an extension. Successful renewal extends the lease; failure to renew means the lease expires.
Revocable (in some systems): The lessor (coordination service) can revoke a lease early, though this is less common for leader election.
Non-transferable: A lease cannot be passed from one holder to another. When leadership changes, the new leader acquires a fresh lease.
| Aspect | Traditional Election | Lease-Based Election |
|---|---|---|
| Leadership grant mechanism | Peer voting/consensus | Time-limited lease acquisition |
| Split-brain prevention | Relies on partition detection | Automatic via lease expiration |
| Coordination service | Optional (embedded election) | Required (lease issuer) |
| Time assumptions | Timeouts for failure detection | Clock synchronization for lease validity |
| Leadership duration | Until failure detected | Until lease expires (bounded) |
| Recovery behavior | Election on failure detection | New lease acquisition after expiry |
Leases and distributed locks are closely related. A lock provides exclusive access; a lease adds time-bounding to that access. In leader election, the leader 'locks' the leadership role, but that lock automatically releases after the lease period. This is why lease-based systems are more robust to failures—a stuck or crashed leader automatically loses leadership rather than holding it indefinitely.
Lease-based systems make a critical assumption: clocks across the system are reasonably synchronized. This assumption deserves careful examination because violations can break safety guarantees.
The clock synchronization requirement:
When a lease server grants a 30-second lease, it expects:
If clocks drift significantly, these assumptions fail. Consider:
Designing for clock uncertainty:
Practical lease systems build in safety margins to handle clock drift:
Guard period approach:
Example safety calculation:
Lease duration: 30 seconds
Maximum clock drift: 200ms/second (extremely conservative)
Maximum drift over lease: 6 seconds
Guard period: max_drift + network_latency = 6s + 1s = 7s
Effective leadership period: 30s - 7s = 23 seconds
The holder should stop acting as leader 7 seconds before lease expiration to guarantee safety even under maximum drift.
Implications:
Production systems typically use NTP for clock synchronization, achieving drift under 1 second in well-managed environments.
NTP can cause sudden clock jumps when correcting large drifts. If a leader's clock jumps forward 10 seconds, it suddenly believes its lease has expired. If it jumps backward, it believes it has more time than reality. Systems should use clock sources that adjust gradually (like chrony's slew mode) rather than stepping. Also, monitor for clock anomalies and treat large jumps as critical alerts.
When a node wants to become the leader, it must acquire a lease. The acquisition protocol varies by implementation, but the core pattern is consistent across systems.
Basic acquisition flow:
Step 1: Check current lease state
Step 2: Attempt lease acquisition
Step 3: Confirmation and activation
Step 4: Begin renewal cycle
| Scenario | Outcome | Node Action |
|---|---|---|
| No current lease exists | Acquisition succeeds | Become leader, start renewals |
| Current lease expired | Acquisition succeeds | Become leader, start renewals |
| Current lease valid (other node) | Acquisition fails | Remain follower, wait for availability |
| Concurrent acquisition attempt | One succeeds, others fail | Winner becomes leader, losers retry |
| Coordination service unreachable | Acquisition fails | Cannot become leader, retry later |
Implementation with etcd:
In etcd (a popular coordination service), lease acquisition uses the LeaseGrant and Put operations:
1. Grant a lease: LeaseGrant(TTL=30s) → returns lease_id
2. Put with lease: Put(key='/leader', value='node-A', lease=lease_id)
3. The Put succeeds only if no current value exists (using If conditions)
4. The key exists as long as the lease is valid
5. When lease expires, key is automatically deleted
The atomicity of Put with conditions ensures only one node can acquire leadership at a time. The automatic key deletion on lease expiry ensures leadership becomes available when the leader fails to renew.
The key to safe lease acquisition is atomic compare-and-swap (CAS). The acquisition request says 'set leader to me IF leader is currently empty.' If two nodes race, both see 'empty' but only one's CAS succeeds. The other's fails because the condition 'leader is empty' is no longer true. This atomicity is provided by the coordination service.
Once a node acquires leadership, it must continuously renew its lease to maintain leadership. Renewal frequency and failure handling are crucial for system reliability.
Renewal timing:
The leader must renew before the lease expires, accounting for:
Rule of third: A common heuristic is to renew at 1/3 of the lease duration:
This gives ample time for transient failures while maintaining safety.
Renewal protocol:
while (isLeader) {
sleep(lease_duration / 3)
for attempt in 1..max_retries {
result = renewLease()
if (result == SUCCESS) {
break // renewal successful
}
sleep(retry_delay) // backoff before retry
}
if (all retries failed) {
// Cannot renew - stop being leader
stepDown()
return
}
}
If a leader becomes too busy (high load) to renew its lease in time, it loses leadership—even though it was actively serving requests. This can cause oscillation: node loses leadership due to load, load decreases, node reacquires leadership, load increases, repeat. Design renewal as a high-priority background task that isn't blocked by business logic.
Network partitions are the nemesis of distributed leader election. Lease-based election provides an elegant solution that neither Bully nor Ring can match: automatic leadership expiration through time.
Partition scenario:
Consider a cluster with leader A, followers B and C, and a coordination service (CS):
Before partition:
A (leader) ← → CS ← → B, C (followers)
During partition:
A (leader) | CS ← → B, C (followers)
↑
Network partition
A is isolated from the coordination service and other nodes.
Lease-based resolution:
Key insight: A stopped acting as leader before a new leader was elected. There was never a moment with two active leaders.
| Aspect | Traditional Election | Lease-Based |
|---|---|---|
| Both partitions elect leader? | Yes (split-brain) | No (lease prevents) |
| Old leader continues operating? | Yes, until timeout | No, stops at guard period |
| New leader election timing | After failure detection | After lease expiry |
| Guaranteed safety | No (partition-vulnerable) | Yes (time-bounded) |
| Partition healing | Requires conflict resolution | Seamless (one leader) |
Lease-based partition safety assumes clocks don't misbehave. If node A's clock runs extremely slow, it might believe its lease is valid when the coordination service has already granted a new lease to B. The guard period must account for maximum expected clock drift. In extreme cases (e.g., VM clock completely frozen), even lease-based systems can have split-brain. Fencing tokens (discussed next) provide an additional safety layer.
Even with leases, a subtle safety gap exists: a leader that experiences a long process pause (GC, swap, etc.) might resume and continue acting as leader after a new leader has been elected. Fencing tokens close this gap by providing external validation of leadership authority.
The pause problem:
So far, so good—A correctly identifies it's no longer leader. But what if:
Fencing tokens solve this:
A fencing token is a monotonically increasing identifier associated with each lease grant:
The storage system acts as a fence, blocking operations from stale leaders.
| Scenario | Without Fencing | With Fencing |
|---|---|---|
| Stale write after pause | Corrupts new data | Rejected by storage |
| Request arrives out of order | May succeed incorrectly | Rejected if stale token |
| Split-brain writes | Both succeed | Only highest token succeeds |
| Implementation complexity | Lower | Requires storage support |
Fencing only works if all external systems (databases, storage, APIs) validate tokens. A storage system that ignores tokens provides no protection. When designing with fencing tokens, ensure all state-mutating operations flow through token-validating systems. This is why systems like Chubby (Google's lock service) provide integrated fencing.
Implementing lease-based leader election correctly requires attention to numerous practical details. Let's examine the key considerations that separate robust implementations from fragile ones.
Lease-based systems are difficult to test because failures are time-dependent. Use chaos engineering techniques: inject network partitions (iptables), simulate clock skew, pause processes (SIGSTOP), kill coordination service nodes. Verify the system never has two leaders acting simultaneously. Netflix's Chaos Monkey and similar tools help automate this testing.
We've thoroughly explored lease-based leader election—a fundamentally different approach that uses time-bounded authority rather than peer voting. Let's consolidate the key takeaways:
What's next:
We'll conclude this module with Leader Election in Practice, examining how production systems—from databases like PostgreSQL and MySQL to distributed systems like Kafka and Kubernetes—implement leader election. We'll see how theoretical concepts translate to real-world deployments and understand the operational considerations that determine which approach to choose.
You now understand lease-based leader election—its elegant time-bounded approach, clock requirements, partition handling, and fencing mechanisms. This is the dominant approach in modern distributed systems due to its inherent partition safety. Next, we'll see how these concepts are applied in real production systems.