System Design (HLD)Leader Election

Leader Election

LevelAdvanced

Duration90 mins

TopicLeader Election

5 / 5

Leader Election in Practice

From Theory to Production

We've explored the theoretical foundations of leader election—Bully, Ring, and Lease-based algorithms—but theory only takes us so far. Production systems face constraints that textbooks rarely address: legacy compatibility, operational complexity, failure correlation, and the cold reality that any system you deploy will eventually fail in ways you didn't anticipate.

This page bridges theory and practice by examining how real-world systems implement leader election. From PostgreSQL's streaming replication to Kubernetes' controller architecture, from Kafka's partition leadership to etcd's Raft consensus, we'll see how production systems adapt, extend, and sometimes deviate from theoretical ideals.

More importantly, we'll extract practical lessons about choosing and operating leader election mechanisms—lessons that only emerge from running these systems at scale.

What You Will Learn

By the end of this page, you will understand how major production systems implement leader election, the operational considerations that influence mechanism choice, common failure patterns and how to prevent them, monitoring and observability requirements, and practical guidance for choosing leader election approaches for your systems.

Leader Election in Databases

Databases were among the first systems to implement distributed leader election because the primary-replica pattern fundamentally requires a single write coordinator. Different database systems take remarkably different approaches, reflecting their design priorities and historical evolution.

PostgreSQL with Patroni:

PostgreSQL itself doesn't include leader election—it provides streaming replication between a primary and replicas. The Patroni project adds cluster management with leader election using external coordination services.

Uses etcd, Consul, or Zookeeper as the coordination backend
Implements lease-based leadership with configurable TTL
Leader holds a key in the coordination service; replicas monitor this key
On leader failure, replicas race to acquire the key (lease acquisition)
Winner is promoted to primary using PostgreSQL's pg_ctl promote
Includes fencing via pg_rewind to handle split-brain scenarios during promotion

Key design decisions:

External coordination avoids embedding consensus in PostgreSQL
Patroni can pause failover during maintenance windows
DCS (Distributed Configuration Store) unavailability prevents failover (safety over availability)

MySQL with Group Replication:

MySQL Group Replication (GR) takes a different approach—it embeds Paxos-based consensus directly into the database:

Nodes form a group and replicate transactions through group consensus
Primary mode: one primary (elected by consensus) handles writes
Multi-primary mode: any node can accept writes (conflict detection)
Built-in failure detection and automatic primary re-election
No external coordination service required

Key design decisions:

Integrated solution simplifies deployment
Paxos overhead impacts write latency
Partition handling: minority partitions become read-only

CockroachDB and TiDB:

NewSQL databases like CockroachDB and TiDB use Raft consensus for both data replication and leader election:

Each data range (shard) has its own Raft group
The Raft leader for each range handles writes to that range
No single cluster-wide leader; leadership is distributed by range
Highly available: any range leader failure affects only that range

Database Leader Election Comparison
Database	Election Mechanism	External Service	Failover Time
PostgreSQL + Patroni	Lease-based	etcd/Consul/ZK	10-30 seconds
MySQL Group Rep.	Embedded Paxos	None	5-15 seconds
CockroachDB	Embedded Raft	None	1-5 seconds
MongoDB	Raft-like (RAFT)	None	10-12 seconds
Redis Sentinel	Custom quorum	Redis Sentinel	10-30 seconds

Failover Time vs Consistency

Faster failover often means more aggressive failure detection, which increases false positive risk. PostgreSQL/Patroni's slower failover is a conscious trade-off for fewer spurious failovers. Consider your tolerance for brief unavailability vs risk of unnecessary failover churn when tuning these systems.

Leader Election in Coordination Services

Coordination services like Zookeeper, etcd, and Consul provide leader election as a primitive that other systems can use. But these services themselves need internal leader election—creating an interesting recursive problem.

Apache Zookeeper:

Zookeeper uses the ZAB (Zookeeper Atomic Broadcast) protocol for consensus and leader election:

Cluster elects a leader through a voting process
Leader processes all write requests and replicates to followers
Leader election triggered on startup or leader failure
Uses epoch numbers (fencing tokens) to prevent split-brain
Leader holds a lease that must be renewed via heartbeats to followers

ZAB election process:

Nodes start in LOOKING state
Each node proposes itself (or its preferred leader)
Votes exchanged until majority agrees
Winner becomes leader, others become followers
Leader synchronizes state with followers before serving requests

Typical failover time: 2-10 seconds (configurable via tickTime and syncLimit)

etcd:

etcd implements Raft consensus, which includes leader election as a core component:

Raft leader handles all writes and replicates to followers
Election triggered by heartbeat timeout (150-300ms default)
Randomized election timeouts prevent election storms
Term numbers (fencing tokens) ensure stale leaders are rejected
Leader sends heartbeats; followers reset election timer on receipt

Raft election process:

Follower's election timer expires (no heartbeat from leader)
Follower becomes candidate, increments term, votes for itself
Candidate requests votes from all nodes
If candidate receives majority, becomes leader
New leader begins sending heartbeats

Typical failover time: 1-3 seconds

HashiCorp Consul:

Consul also uses Raft consensus for its server nodes:

Similar to etcd; leader handles writes for the consensus log
Additionally supports distributed locks for application-level leader election
Lock sessions with TTL provide lease-based locks
Applications can elect leaders using Consul's lock API

Coordination Service Leader Election
Service	Algorithm	Failure Detection	Split-Brain Prevention
Zookeeper	ZAB (Paxos-like)	Heartbeat + session timeout	Epoch numbers + quorum
etcd	Raft	Heartbeat timeout	Term numbers + quorum
Consul	Raft	Heartbeat timeout	Term numbers + quorum

Bootstrapping Problem

Coordination services need leader election to function, but other systems use coordination services for leader election. If the coordination service fails, dependent systems cannot elect leaders. This creates a dependency hierarchy: coordination services must be your most reliable infrastructure. Run them on dedicated, stable nodes with excellent monitoring.

Leader Election in Container Orchestration

Container orchestration platforms like Kubernetes manage thousands of workloads across potentially thousands of nodes. Leader election is essential for ensuring exactly one controller processes each type of resource.

Kubernetes Control Plane Components:

Kubernetes runs multiple replicas of control plane components for high availability. Each component type uses leader election to ensure only one is active:

kube-scheduler: Only one scheduler should assign pods to nodes
kube-controller-manager: Only one controller manager should process controller logic
cloud-controller-manager: Only one should interact with cloud provider APIs

Kubernetes leader election implementation:

Kubernetes uses a lease-based approach with the Lease resource (or previously ConfigMap/Endpoints):

A Lease resource is created for each component type
The current leader's identity is stored in the Lease
The leader renews the Lease periodically (default: 15s lease, 10s renew, 2s retry)
If renewal fails, the leader stops processing
Other replicas attempt to acquire the Lease (compare-and-swap)
First to successfully update the Lease becomes the new leader

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-scheduler
  namespace: kube-system
spec:
  holderIdentity: scheduler-replica-1
  leaseDurationSeconds: 15
  renewTime: "2024-01-15T10:00:00Z"
  acquireTime: "2024-01-15T09:00:00Z"

Application-level leader election in Kubernetes:

Applications running in Kubernetes can also use leader election. The Kubernetes client libraries provide leader election primitives:

// Go client-go example
lock := &resourcelock.LeaseLock{
    LeaseMeta: metav1.ObjectMeta{
        Name:      "my-app-lock",
        Namespace: "default",
    },
    Client: client.CoordinationV1(),
    LockConfig: resourcelock.ResourceLockConfig{
        Identity: os.Hostname(),
    },
}

leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
    Lock:          lock,
    LeaseDuration: 15 * time.Second,
    RenewDeadline: 10 * time.Second,
    RetryPeriod:   2 * time.Second,
    Callbacks: leaderelection.LeaderCallbacks{
        OnStartedLeading: func(ctx context.Context) {
            // Start doing leader work
        },
        OnStoppedLeading: func() {
            // Clean up, stop processing
        },
    },
})

API Server Dependency

Kubernetes leader election depends on the API server (which depends on etcd). If etcd is unavailable, leader election cannot proceed. In HA setups, ensure etcd has sufficient redundancy (at least 3 nodes). Also note that leader election traffic adds load to etcd; for applications with many replicas, consider longer lease durations to reduce write frequency.

Leader Election in Streaming Systems

Streaming platforms like Apache Kafka handle massive data volumes and require leader election for both partition leadership and cluster coordination. Their approach to leader election demonstrates sophisticated production requirements.

Apache Kafka:

Kafka has two levels of leadership:

1. Partition Leadership:

Each topic partition has one leader and multiple followers
The leader handles all reads and writes for that partition
Followers replicate from the leader
If leader fails, a follower in the ISR (In-Sync Replicas) is promoted

2. Controller Leadership:

One broker serves as the cluster controller
Controller manages partition assignments and leader elections
Controller election itself uses Zookeeper (or KRaft)

Traditional Kafka (with Zookeeper):

Brokers race to create ephemeral znodes in Zookeeper
First to create /controller znode becomes controller
Zookeeper sessions provide lease behavior
Controller failure triggers new controller election

KRaft mode (Kafka Raft Metadata mode):

Eliminates Zookeeper dependency
Dedicated controller quorum uses Raft consensus
Controller quorum maintains cluster metadata
Partition leaders elected by the controller quorum

Partition leader election in Kafka:

Controller detects broker failure (via heartbeat)
Controller identifies affected partitions
For each partition, controller selects new leader from ISR
Controller updates metadata and notifies all brokers
Clients refresh metadata and connect to new leaders

Apache Pulsar:

Pulsar uses a different architecture with separate serving and storage layers:

Brokers are stateless and don't own data
BookKeeper stores data durably
Topic 'ownership' (similar to leadership) is managed by Zookeeper
Multiple brokers can own different topics
Broker failure: topics are reassigned to other brokers quickly (no data movement needed)

Key insight: By separating compute (brokers) from storage (BookKeeper), Pulsar makes 'leadership' failover nearly instantaneous—new broker just needs to start serving requests from the existing storage.

Streaming System Leader Election
System	Leadership Scope	Election Mechanism	Failover Impact
Kafka (ZK)	Partition + Controller	Zookeeper ephemeral nodes	Partition unavailable during election
Kafka (KRaft)	Partition + Controller	Raft consensus	Faster, no ZK dependency
Pulsar	Topic ownership	Zookeeper coordination	Near-instant (stateless brokers)
RabbitMQ	Queue master	Raft (Quorum Queues)	Queue unavailable briefly

Replication Factor Matters

In Kafka, if a partition has replication factor 1 (no replicas), leader failure means data loss and extended unavailability. With replication factor 3, the ISR can promote a follower immediately. Always use replication factor >= 3 for important topics, and configure min.insync.replicas to prevent accepting writes that can't be durably replicated.

Operational Considerations

Operating leader election systems at scale reveals challenges that aren't obvious from theory. These operational lessons come from running production systems through failures, outages, and edge cases.

Critical Operational Lessons

•Prefer stability over speed — Fast failover sounds good but causes 'flapping' when network is unstable. Conservative timeouts reduce unnecessary elections.
•Correlated failures are common — When a datacenter loses power, all nodes fail simultaneously. Leader election can't help; design for complete DC failure separately.
•Test actual failure modes — Don't just test 'kill the leader.' Test network partitions, slow networks, clock skew, full disks, OOM conditions. Each reveals different issues.
•Document failover procedures — Even with automatic failover, operators need runbooks for manual intervention. Document how to force leadership, how to recover from split-brain, how to rebuild after catastrophe.
•Plan for cascading failures — Leader election can cascade: coordination service elects leader, database uses coordination service for its election, application uses database. Failure at any layer affects all above it.

Maintenance windows and planned failovers:

Production systems need maintenance. Graceful leadership handoff is essential:

Drain before failover: Allow current leader to complete in-flight operations
Preferred leader configuration: Some systems support specifying preferred leaders for planned promotions
Pause failover during maintenance: Systems like Patroni allow pausing automatic failover during manual operations
Rolling updates: Update followers first, then trigger controlled failover, then update old leader

Node placement for availability:

Distribute leader election participants across failure domains:

Minimum 3 nodes for quorum (tolerates 1 failure)
Place nodes in different availability zones
For cross-region: consider latency impact on consensus
Odd numbers preferred (3, 5, 7) to avoid tie scenarios

The Thundering Herd

When a coordination service recovers from an outage, all waiting clients may simultaneously attempt to acquire leadership. This 'thundering herd' can overwhelm the freshly-recovered service. Implement jittered backoff: each client waits a random delay before attempting acquisition. Most coordination client libraries handle this, but verify your implementation.

Monitoring and Observability

Effective monitoring is essential for operating leader election systems. You need to know when elections happen, how long they take, and when the system is at risk of unnecessary failover.

Essential Leader Election Metrics
Metric	Description	Alert Threshold
leader_election_count	Number of elections (per time window)	3 per hour may indicate instability
leader_election_duration	Time from start to completion	2x expected duration
lease_renewal_latency	Time to renew leadership lease	50% of lease duration
time_since_last_heartbeat	Follower's view of leader liveness	Approaching timeout threshold
current_leader	Identity of current leader	Undefined/null indicates no leader
split_brain_detected	Multiple nodes claiming leadership	Any occurrence is critical

Observability Best Practices

•Log all leadership transitions — Who was leader, who became leader, why the change occurred, how long the election took.
•Trace election events — Distributed tracing through election messages helps debug slow or failed elections.
•Dashboard showing current leaders — At a glance, operators should see which node is leader for each component.
•Alert on stale leaders — If the same node has been leader for months without any failover, test that failover actually works.
•Monitor coordination service health — If etcd/Zookeeper is degraded, leader election is at risk even if current leaders seem healthy.
•Track clock skew across nodes — Clock divergence threatens lease-based systems. Monitor NTP sync status.

Example monitoring setup for Patroni (PostgreSQL):

# Prometheus alerting rules for Patroni
groups:
- name: patroni-leader-election
  rules:
  - alert: PostgresLeaderUnknown
    expr: patroni_leader == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "No PostgreSQL leader detected"
  
  - alert: TooManyLeaderElections
    expr: increase(patroni_failovers_total[1h]) > 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Excessive PostgreSQL failovers"
  
  - alert: LeaderLeaseExpiringSoon
    expr: patroni_leader_ttl_seconds < 5
    for: 10s
    labels:
      severity: warning
    annotations:
      summary: "PostgreSQL leader lease almost expired"

Runbooks are Documentation

Every alert should link to a runbook explaining what the alert means and how to respond. For leader election alerts, runbooks should cover: how to identify the current leader, how to trigger manual failover, how to investigate election failures, and when to escalate.

Choosing a Leader Election Approach

When designing a new system that requires leader election, how do you choose an approach? Here's a practical decision framework based on system characteristics and requirements.

Decision tree:

Q1: Do you already have a coordination service (etcd, Zookeeper, Consul)?

YES → Use its leader election primitives (leases/locks)
NO → Continue to Q2

Q2: How critical is split-brain prevention?

CRITICAL (data corruption possible) → Use coordination service or embedded Raft
TOLERABLE (duplicates acceptable) → Simpler approaches may work

Q3: What's your availability requirement?

Five 9s (< 5 min downtime/year) → Embedded consensus (faster failover, no external dependency)
Three-four 9s → Coordination service is fine

Q4: What's your team's expertise?

Distributed systems experts → Can implement/operate any approach
General developers → Use battle-tested libraries (Kubernetes client-go, etc.)

Q5: What's your scaling requirement?

Few leaders (< 100) → Any approach works
Many leaders (1000+) → Watch coordination service load; consider sharding

Use External Coordination When

•Team lacks distributed systems expertise
•Already running etcd/ZK/Consul
•Failover time of 10-30s is acceptable
•Want clear separation of concerns
•Need additional features (config, service discovery)

Use Embedded Consensus When

•Need fastest possible failover (< 5s)
•Cannot tolerate coordination service dependency
•Team can implement and operate Raft/Paxos
•System is already distributed (adding more infra is costly)
•Building a coordination service itself

The Hidden Cost of DIY

Implementing leader election correctly is surprisingly difficult. Edge cases around network partitions, clock skew, and process pauses are subtle. Unless you have strong requirements for embedded consensus, using a well-tested coordination service is almost always the right choice. The operational cost of debugging a buggy leader election implementation far exceeds the cost of running an additional service.

Summary: Leader Election in Practice

We've explored how production systems implement leader election, learning lessons that only emerge from operating these systems at scale. Let's consolidate the key takeaways:

Key Takeaways

•Databases split between embedded and external election — Some (MySQL GR, CockroachDB) embed consensus; others (PostgreSQL+Patroni) use external coordination. Both work; choose based on your operational model.
•Coordination services use their own consensus — Zookeeper (ZAB), etcd (Raft), and Consul (Raft) implement consensus internally and expose leader election as a primitive for other systems.
•Kubernetes uses lease-based election — The Lease resource provides application-friendly leader election that integrates with the Kubernetes ecosystem.
•Streaming systems have multi-level leadership — Kafka has both partition leaders and cluster controllers; understanding both levels is essential for operation.
•Operational excellence matters more than algorithm choice — Monitoring, alerting, runbooks, and tested failover procedures matter more than whether you use Raft or ZAB.
•Default to existing solutions — Unless you have specific requirements, use lease-based election with an existing coordination service. Don't implement consensus yourself.

Module complete:

You've now journeyed through the complete landscape of leader election—from the fundamental need for coordination, through classic algorithms (Bully, Ring), to modern lease-based approaches, and finally to production implementations. This knowledge equips you to design, implement, and operate distributed systems that require single-leader coordination, and to evaluate the trade-offs inherent in different approaches.

Module Complete

Congratulations! You've mastered leader election in distributed systems. You understand when leader election is needed, how classic algorithms work, why lease-based approaches dominate production, and how real systems implement and operate leader election. This knowledge is foundational for designing reliable distributed systems at scale.

5 / 5

Loading learning content...

System Design (HLD)Leader Election

Leader Election

LevelAdvanced

Duration90 mins

TopicLeader Election

5 / 5

Leader Election in Practice

From Theory to Production

More importantly, we'll extract practical lessons about choosing and operating leader election mechanisms—lessons that only emerge from running these systems at scale.

What You Will Learn

Leader Election in Databases

PostgreSQL with Patroni:

Uses etcd, Consul, or Zookeeper as the coordination backend
Implements lease-based leadership with configurable TTL
Leader holds a key in the coordination service; replicas monitor this key
On leader failure, replicas race to acquire the key (lease acquisition)
Winner is promoted to primary using PostgreSQL's pg_ctl promote
Includes fencing via pg_rewind to handle split-brain scenarios during promotion

Key design decisions:

External coordination avoids embedding consensus in PostgreSQL
Patroni can pause failover during maintenance windows
DCS (Distributed Configuration Store) unavailability prevents failover (safety over availability)

MySQL with Group Replication:

MySQL Group Replication (GR) takes a different approach—it embeds Paxos-based consensus directly into the database:

Nodes form a group and replicate transactions through group consensus
Primary mode: one primary (elected by consensus) handles writes
Multi-primary mode: any node can accept writes (conflict detection)
Built-in failure detection and automatic primary re-election
No external coordination service required

Key design decisions:

Integrated solution simplifies deployment
Paxos overhead impacts write latency
Partition handling: minority partitions become read-only

CockroachDB and TiDB:

NewSQL databases like CockroachDB and TiDB use Raft consensus for both data replication and leader election:

Each data range (shard) has its own Raft group
The Raft leader for each range handles writes to that range
No single cluster-wide leader; leadership is distributed by range
Highly available: any range leader failure affects only that range

Database Leader Election Comparison
Database	Election Mechanism	External Service	Failover Time
PostgreSQL + Patroni	Lease-based	etcd/Consul/ZK	10-30 seconds
MySQL Group Rep.	Embedded Paxos	None	5-15 seconds
CockroachDB	Embedded Raft	None	1-5 seconds
MongoDB	Raft-like (RAFT)	None	10-12 seconds
Redis Sentinel	Custom quorum	Redis Sentinel	10-30 seconds

Failover Time vs Consistency

Leader Election in Coordination Services

Apache Zookeeper:

Zookeeper uses the ZAB (Zookeeper Atomic Broadcast) protocol for consensus and leader election:

Cluster elects a leader through a voting process
Leader processes all write requests and replicates to followers
Leader election triggered on startup or leader failure
Uses epoch numbers (fencing tokens) to prevent split-brain
Leader holds a lease that must be renewed via heartbeats to followers

ZAB election process:

Nodes start in LOOKING state
Each node proposes itself (or its preferred leader)
Votes exchanged until majority agrees
Winner becomes leader, others become followers
Leader synchronizes state with followers before serving requests

Typical failover time: 2-10 seconds (configurable via tickTime and syncLimit)

etcd:

etcd implements Raft consensus, which includes leader election as a core component:

Raft leader handles all writes and replicates to followers
Election triggered by heartbeat timeout (150-300ms default)
Randomized election timeouts prevent election storms
Term numbers (fencing tokens) ensure stale leaders are rejected
Leader sends heartbeats; followers reset election timer on receipt

Raft election process:

Follower's election timer expires (no heartbeat from leader)
Follower becomes candidate, increments term, votes for itself
Candidate requests votes from all nodes
If candidate receives majority, becomes leader
New leader begins sending heartbeats

Typical failover time: 1-3 seconds

HashiCorp Consul:

Consul also uses Raft consensus for its server nodes:

Similar to etcd; leader handles writes for the consensus log
Additionally supports distributed locks for application-level leader election
Lock sessions with TTL provide lease-based locks
Applications can elect leaders using Consul's lock API

Coordination Service Leader Election
Service	Algorithm	Failure Detection	Split-Brain Prevention
Zookeeper	ZAB (Paxos-like)	Heartbeat + session timeout	Epoch numbers + quorum
etcd	Raft	Heartbeat timeout	Term numbers + quorum
Consul	Raft	Heartbeat timeout	Term numbers + quorum

Bootstrapping Problem

Leader Election in Container Orchestration

Kubernetes Control Plane Components:

Kubernetes runs multiple replicas of control plane components for high availability. Each component type uses leader election to ensure only one is active:

kube-scheduler: Only one scheduler should assign pods to nodes
kube-controller-manager: Only one controller manager should process controller logic
cloud-controller-manager: Only one should interact with cloud provider APIs

Kubernetes leader election implementation:

Kubernetes uses a lease-based approach with the Lease resource (or previously ConfigMap/Endpoints):

A Lease resource is created for each component type
The current leader's identity is stored in the Lease
The leader renews the Lease periodically (default: 15s lease, 10s renew, 2s retry)
If renewal fails, the leader stops processing
Other replicas attempt to acquire the Lease (compare-and-swap)
First to successfully update the Lease becomes the new leader

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-scheduler
  namespace: kube-system
spec:
  holderIdentity: scheduler-replica-1
  leaseDurationSeconds: 15
  renewTime: "2024-01-15T10:00:00Z"
  acquireTime: "2024-01-15T09:00:00Z"

Application-level leader election in Kubernetes:

Applications running in Kubernetes can also use leader election. The Kubernetes client libraries provide leader election primitives:

// Go client-go example
lock := &resourcelock.LeaseLock{
    LeaseMeta: metav1.ObjectMeta{
        Name:      "my-app-lock",
        Namespace: "default",
    },
    Client: client.CoordinationV1(),
    LockConfig: resourcelock.ResourceLockConfig{
        Identity: os.Hostname(),
    },
}

leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
    Lock:          lock,
    LeaseDuration: 15 * time.Second,
    RenewDeadline: 10 * time.Second,
    RetryPeriod:   2 * time.Second,
    Callbacks: leaderelection.LeaderCallbacks{
        OnStartedLeading: func(ctx context.Context) {
            // Start doing leader work
        },
        OnStoppedLeading: func() {
            // Clean up, stop processing
        },
    },
})

API Server Dependency

Leader Election in Streaming Systems

Apache Kafka:

Kafka has two levels of leadership:

1. Partition Leadership:

Each topic partition has one leader and multiple followers
The leader handles all reads and writes for that partition
Followers replicate from the leader
If leader fails, a follower in the ISR (In-Sync Replicas) is promoted

2. Controller Leadership:

One broker serves as the cluster controller
Controller manages partition assignments and leader elections
Controller election itself uses Zookeeper (or KRaft)

Traditional Kafka (with Zookeeper):

Brokers race to create ephemeral znodes in Zookeeper
First to create /controller znode becomes controller
Zookeeper sessions provide lease behavior
Controller failure triggers new controller election

KRaft mode (Kafka Raft Metadata mode):

Eliminates Zookeeper dependency
Dedicated controller quorum uses Raft consensus
Controller quorum maintains cluster metadata
Partition leaders elected by the controller quorum

Partition leader election in Kafka:

Controller detects broker failure (via heartbeat)
Controller identifies affected partitions
For each partition, controller selects new leader from ISR
Controller updates metadata and notifies all brokers
Clients refresh metadata and connect to new leaders

Apache Pulsar:

Pulsar uses a different architecture with separate serving and storage layers:

Brokers are stateless and don't own data
BookKeeper stores data durably
Topic 'ownership' (similar to leadership) is managed by Zookeeper
Multiple brokers can own different topics
Broker failure: topics are reassigned to other brokers quickly (no data movement needed)

Streaming System Leader Election
System	Leadership Scope	Election Mechanism	Failover Impact
Kafka (ZK)	Partition + Controller	Zookeeper ephemeral nodes	Partition unavailable during election
Kafka (KRaft)	Partition + Controller	Raft consensus	Faster, no ZK dependency
Pulsar	Topic ownership	Zookeeper coordination	Near-instant (stateless brokers)
RabbitMQ	Queue master	Raft (Quorum Queues)	Queue unavailable briefly

Replication Factor Matters

Operational Considerations

Critical Operational Lessons

•Prefer stability over speed — Fast failover sounds good but causes 'flapping' when network is unstable. Conservative timeouts reduce unnecessary elections.
•Correlated failures are common — When a datacenter loses power, all nodes fail simultaneously. Leader election can't help; design for complete DC failure separately.
•Test actual failure modes — Don't just test 'kill the leader.' Test network partitions, slow networks, clock skew, full disks, OOM conditions. Each reveals different issues.
•Document failover procedures — Even with automatic failover, operators need runbooks for manual intervention. Document how to force leadership, how to recover from split-brain, how to rebuild after catastrophe.
•Plan for cascading failures — Leader election can cascade: coordination service elects leader, database uses coordination service for its election, application uses database. Failure at any layer affects all above it.

Maintenance windows and planned failovers:

Production systems need maintenance. Graceful leadership handoff is essential:

Drain before failover: Allow current leader to complete in-flight operations
Preferred leader configuration: Some systems support specifying preferred leaders for planned promotions
Pause failover during maintenance: Systems like Patroni allow pausing automatic failover during manual operations
Rolling updates: Update followers first, then trigger controlled failover, then update old leader

Node placement for availability:

Distribute leader election participants across failure domains:

Minimum 3 nodes for quorum (tolerates 1 failure)
Place nodes in different availability zones
For cross-region: consider latency impact on consensus
Odd numbers preferred (3, 5, 7) to avoid tie scenarios

The Thundering Herd

Monitoring and Observability

Effective monitoring is essential for operating leader election systems. You need to know when elections happen, how long they take, and when the system is at risk of unnecessary failover.

Essential Leader Election Metrics
Metric	Description	Alert Threshold
leader_election_count	Number of elections (per time window)	3 per hour may indicate instability
leader_election_duration	Time from start to completion	2x expected duration
lease_renewal_latency	Time to renew leadership lease	50% of lease duration
time_since_last_heartbeat	Follower's view of leader liveness	Approaching timeout threshold
current_leader	Identity of current leader	Undefined/null indicates no leader
split_brain_detected	Multiple nodes claiming leadership	Any occurrence is critical

Observability Best Practices

•Log all leadership transitions — Who was leader, who became leader, why the change occurred, how long the election took.
•Trace election events — Distributed tracing through election messages helps debug slow or failed elections.
•Dashboard showing current leaders — At a glance, operators should see which node is leader for each component.
•Alert on stale leaders — If the same node has been leader for months without any failover, test that failover actually works.
•Monitor coordination service health — If etcd/Zookeeper is degraded, leader election is at risk even if current leaders seem healthy.
•Track clock skew across nodes — Clock divergence threatens lease-based systems. Monitor NTP sync status.

Example monitoring setup for Patroni (PostgreSQL):

# Prometheus alerting rules for Patroni
groups:
- name: patroni-leader-election
  rules:
  - alert: PostgresLeaderUnknown
    expr: patroni_leader == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "No PostgreSQL leader detected"
  
  - alert: TooManyLeaderElections
    expr: increase(patroni_failovers_total[1h]) > 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Excessive PostgreSQL failovers"
  
  - alert: LeaderLeaseExpiringSoon
    expr: patroni_leader_ttl_seconds < 5
    for: 10s
    labels:
      severity: warning
    annotations:
      summary: "PostgreSQL leader lease almost expired"

Runbooks are Documentation

Choosing a Leader Election Approach

When designing a new system that requires leader election, how do you choose an approach? Here's a practical decision framework based on system characteristics and requirements.

Decision tree:

Q1: Do you already have a coordination service (etcd, Zookeeper, Consul)?

YES → Use its leader election primitives (leases/locks)
NO → Continue to Q2

Q2: How critical is split-brain prevention?

CRITICAL (data corruption possible) → Use coordination service or embedded Raft
TOLERABLE (duplicates acceptable) → Simpler approaches may work

Q3: What's your availability requirement?

Five 9s (< 5 min downtime/year) → Embedded consensus (faster failover, no external dependency)
Three-four 9s → Coordination service is fine

Q4: What's your team's expertise?

Distributed systems experts → Can implement/operate any approach
General developers → Use battle-tested libraries (Kubernetes client-go, etc.)

Q5: What's your scaling requirement?

Few leaders (< 100) → Any approach works
Many leaders (1000+) → Watch coordination service load; consider sharding

Use External Coordination When

•Team lacks distributed systems expertise
•Already running etcd/ZK/Consul
•Failover time of 10-30s is acceptable
•Want clear separation of concerns
•Need additional features (config, service discovery)

Use Embedded Consensus When

•Need fastest possible failover (< 5s)
•Cannot tolerate coordination service dependency
•Team can implement and operate Raft/Paxos
•System is already distributed (adding more infra is costly)
•Building a coordination service itself

The Hidden Cost of DIY

Summary: Leader Election in Practice

We've explored how production systems implement leader election, learning lessons that only emerge from operating these systems at scale. Let's consolidate the key takeaways:

Key Takeaways

•Databases split between embedded and external election — Some (MySQL GR, CockroachDB) embed consensus; others (PostgreSQL+Patroni) use external coordination. Both work; choose based on your operational model.
•Coordination services use their own consensus — Zookeeper (ZAB), etcd (Raft), and Consul (Raft) implement consensus internally and expose leader election as a primitive for other systems.
•Kubernetes uses lease-based election — The Lease resource provides application-friendly leader election that integrates with the Kubernetes ecosystem.
•Streaming systems have multi-level leadership — Kafka has both partition leaders and cluster controllers; understanding both levels is essential for operation.
•Operational excellence matters more than algorithm choice — Monitoring, alerting, runbooks, and tested failover procedures matter more than whether you use Raft or ZAB.
•Default to existing solutions — Unless you have specific requirements, use lease-based election with an existing coordination service. Don't implement consensus yourself.

Module complete:

Module Complete

5 / 5