Loading learning content...
Before asking "ZooKeeper or etcd?", ask a more fundamental question: Do you need a dedicated coordination service at all?
Coordination services are powerful tools, but they're also infrastructure you must operate, monitor, and maintain. They introduce new failure modes, require expertise, and add complexity to your architecture. Sometimes they're essential. Sometimes they're overkill.
This page helps you make that determination — understanding when coordination services are the right tool and when simpler alternatives might serve better.
By the end of this page, you will understand the specific problems that require coordination services, the patterns where simpler solutions suffice, the anti-patterns that indicate misuse, and a decision framework for whether to adopt a coordination service.
At their core, coordination services solve problems that arise when multiple distributed processes need to agree on something. This "something" might be:
These problems are fundamentally hard in distributed systems because of the constraints imposed by the CAP theorem, network partitions, and asynchronous communication.
The Key Insight: Consensus is Unavoidable — But Maybe You Don't Need It
If your problem genuinely requires consensus (e.g., exactly-once leader, true mutual exclusion, linearizable reads), you cannot escape it. You'll either use a coordination service that provides consensus or build a bespoke consensus mechanism — which is much harder to get right.
However, many problems that seem to require consensus actually don't. They can be solved with weaker guarantees that are much cheaper to provide. Recognizing the difference is crucial for architectural efficiency.
The litmus test for needing true consensus: what happens if your coordination fails and two nodes both believe they're the leader (or both hold the lock)? If the answer is 'data corruption' or 'duplicate financial transactions', you need robust coordination. If the answer is 'some duplicate work gets done', you might not.
Some patterns genuinely require the guarantees that only a coordination service (or equivalent consensus system) can provide. When you encounter these patterns, investing in proper coordination is not premature optimization — it's essential correctness.
Pattern: Only one node may write to a resource or make decisions at any time.
Examples:
Why Coordination is Required:
Alternative Evaluation:
Before reaching for ZooKeeper, etcd, or Consul, consider whether these simpler approaches might solve your problem with less operational overhead.
1234567891011121314151617181920212223242526272829
-- Simple leader election using PostgreSQL-- Table to track leaderCREATE TABLE leader_election ( resource_name VARCHAR(255) PRIMARY KEY, leader_id VARCHAR(255) NOT NULL, acquired_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), expires_at TIMESTAMPTZ NOT NULL); -- Attempt to become leader (or renew)INSERT INTO leader_election (resource_name, leader_id, expires_at)VALUES ('my-service', 'instance-123', NOW() + INTERVAL '30 seconds')ON CONFLICT (resource_name) DO UPDATE SET leader_id = EXCLUDED.leader_id, acquired_at = NOW(), expires_at = EXCLUDED.expires_atWHERE leader_election.expires_at < NOW() -- Only if expired OR leader_election.leader_id = 'instance-123'; -- Or we're renewing -- Check if we're leader (returns row if yes)SELECT * FROM leader_election WHERE resource_name = 'my-service' AND leader_id = 'instance-123' AND expires_at > NOW(); -- Simple lock using advisory locks (PostgreSQL)SELECT pg_try_advisory_lock(hashtext('my-lock-name'));-- Returns true if lock acquired, false if held by another session-- Lock automatically releases when session endsRedis' Redlock algorithm claims to provide distributed locks, but Martin Kleppmann's analysis showed it can fail under certain timing assumptions. For coordination where correctness is critical (financial transactions, data integrity), prefer true consensus-based systems. For coordination where occasional failures are acceptable (duplicate job execution), Redis is often fine.
123456789101112131415161718192021222324252627282930313233343536373839
import boto3from botocore.exceptions import ClientError dynamodb = boto3.resource('dynamodb')table = dynamodb.Table('distributed-locks') def acquire_lock(lock_name: str, owner_id: str, ttl_seconds: int = 30): """Attempt to acquire a lock using DynamoDB conditional writes.""" import time expires_at = int(time.time()) + ttl_seconds try: table.put_item( Item={ 'lock_name': lock_name, 'owner_id': owner_id, 'expires_at': expires_at }, # Only succeeds if lock doesn't exist OR is expired ConditionExpression='attribute_not_exists(lock_name) OR expires_at < :now', ExpressionAttributeValues={':now': int(time.time())} ) return True # Lock acquired except ClientError as e: if e.response['Error']['Code'] == 'ConditionalCheckFailedException': return False # Lock held by someone else raise def release_lock(lock_name: str, owner_id: str): """Release a lock only if we own it.""" try: table.delete_item( Key={'lock_name': lock_name}, ConditionExpression='owner_id = :owner', ExpressionAttributeValues={':owner': owner_id} ) return True except ClientError: return False # We don't own itThe Idempotency Mindset
Before adding coordination to prevent duplicates, ask: Can I make the duplicate harmless?
Idempotency often eliminates the need for distributed locks entirely, which is strictly better than adding coordination infrastructure.
Coordination services are sometimes misused in ways that create problems rather than solve them. Recognize these anti-patterns to avoid common pitfalls.
If 90% of your coordination requests are for the same resource (the same lock, the same leader check), you probably don't need distributed coordination — you need to redesign to eliminate that single point of coordination or cache the result.
Use this decision tree to determine whether a dedicated coordination service is right for your situation.
| Situation | Recommendation |
|---|---|
| Simple leader election, already on K8s | Kubernetes LeaderElection (uses etcd) |
| Infrequent locking, have PostgreSQL | Database advisory locks |
| High-frequency locking, performance critical | Dedicated coordination service |
| Service discovery, VMs + containers | Consul |
| Configuration management, cloud-native | etcd or K8s ConfigMaps |
| Need service mesh | Consul (or Istio if K8s-only) |
| Running Kafka or Hadoop | ZooKeeper (already required) |
| Simple rate limiting | Redis |
| True distributed transactions | Consider whether you really need them, then 2PC or Saga |
Let's examine real scenarios where organizations made (or reconsidered) decisions about coordination services.
Case: E-commerce Company Removes ZooKeeper
Initial State:
Analysis:
Solution:
Result:
Notice how each case has a different right answer. The e-commerce company didn't need coordination at all. The fintech needed it but had the wrong tool. The enterprise needed Consul's specific capabilities. Your situation will be similarly unique.
Every coordination service deployment has costs. Make sure the benefits outweigh them.
The Break-Even Question
A coordination service is justified when:
If you're uncertain, start without a coordination service. Add one when you hit limitations of simpler approaches. This follows the principle of earning complexity through proven need.
You've completed the Coordination Services module. You now understand ZooKeeper's hierarchical model, etcd's Raft-based simplicity, Consul's integrated approach, how to choose between them, and when to use them at all. This knowledge forms the foundation for designing reliable distributed systems that coordinate correctly without unnecessary complexity.
What's Next:
With a deep understanding of distributed coordination and consensus, you're prepared to explore the next major topic in system design: Scalability Fundamentals — understanding how systems grow to handle increasing load while maintaining performance and reliability.