Loading content...
A Kubernetes cluster is a distributed system with a clear separation between the components that make decisions (the control plane) and the components that execute those decisions (the worker nodes). This architectural split is fundamental to Kubernetes' reliability, scalability, and operational model.
In production environments, understanding this separation is critical. When something goes wrong—and it will—knowing whether the issue lies in the control plane or on a worker node determines your troubleshooting path. When planning for high availability, you need to understand which components need redundancy and how they coordinate.
By the end of this page, you will understand the complete architecture of both the control plane and worker nodes. You'll know how to configure high availability, recognize failure scenarios, and make informed decisions about cluster topology for different scales and reliability requirements.
The control plane (historically called the "master") is the brain of the Kubernetes cluster. It exposes the Kubernetes API, tracks the state of all cluster objects, and makes scheduling decisions. A well-designed control plane is highly available and can survive individual component failures.
Control Plane Components:
| Component | Primary Function | Stateful? | HA Strategy |
|---|---|---|---|
| kube-apiserver | API gateway for all cluster communication | No | Multiple replicas behind load balancer |
| etcd | Distributed key-value store for cluster state | Yes | Odd-numbered cluster (3, 5, 7) |
| kube-scheduler | Pod-to-node assignment decisions | No | Leader election (active/standby) |
| kube-controller-manager | Runs reconciliation controllers | No | Leader election (active/standby) |
| cloud-controller-manager | Cloud provider integration | No | Leader election (active/standby) |
Communication Patterns:
The control plane components communicate through specific patterns:
API Server as Hub: All components communicate through the API server; no direct component-to-component communication
etcd Access: Only the API server reads from and writes to etcd; this protects data consistency
Watch-Based Events: Controllers and scheduler watch the API server for changes rather than polling
Leader Election: Scheduler and controller-manager use leader election to ensure only one active instance
Control Plane Communication Topology: ┌─────────────────────────────────────────────────────────────────┐│ CONTROL PLANE │├─────────────────────────────────────────────────────────────────┤│ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ etcd-1 │◄────►│ etcd-2 │◄────►│ etcd-3 │ ││ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ ││ │ │ │ ││ └─────────────────────┼─────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────┐ ││ │ kube-apiserver cluster │ ││ │ (api-1) (api-2) (api-3) │ ││ │ │ │ │ │ ││ │ └─────────────┼─────────────┘ │ ││ │ │ │ ││ │ Load Balancer │ ││ └─────────────────────────────────────────────────────────┘ ││ │ ││ ┌─────────────────────┼─────────────────────┐ ││ ▼ ▼ ▼ ││ ┌───────────────┐ ┌─────────────────┐ ┌───────────────┐ ││ │ kube-scheduler│ │kube-controller- │ │cloud-controller│ ││ │ (leader elect)│ │manager (leader) │ │-manager (leader│ ││ └───────────────┘ └─────────────────┘ └───────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────┐ │ Worker Nodes │ │ (kubelet, kube-proxy) │ └─────────────────────────┘In managed services (EKS, GKE, AKS), the control plane is fully managed by the cloud provider. You don't see or manage these components—they're abstracted behind an API endpoint. Understanding the architecture still helps with debugging and advanced configurations.
etcd is arguably the most critical component in a Kubernetes cluster. It's the sole source of truth for all cluster data—losing etcd means losing your entire cluster configuration.
Why etcd for Kubernetes?
Kubernetes chose etcd because it provides:
How Kubernetes Uses etcd:
Kubernetes Data Structure in etcd: /registry/├── pods/│ ├── default/│ │ ├── nginx-abc123│ │ └── web-app-xyz789│ └── kube-system/│ ├── coredns-abc123│ └── kube-proxy-xyz789├── deployments/│ └── default/│ └── web-app├── replicasets/│ └── default/│ └── web-app-5d7f8b9c4d├── services/│ └── default/│ └── web-service├── secrets/│ └── default/│ └── db-credentials├── configmaps/│ └── default/│ └── app-config├── namespaces/│ ├── default│ ├── kube-system│ └── production└── ... Each key stores the JSON/protobuf encoded resource.Watch on prefixes enables efficient notifications.etcd Cluster Sizing:
The Raft consensus algorithm requires a majority (quorum) for writes:
| Cluster Size | Quorum | Tolerated Failures | Recommendation |
|---|---|---|---|
| 1 | 1 | 0 | Development only |
| 3 | 2 | 1 | Small production |
| 5 | 3 | 2 | Large production |
| 7 | 4 | 3 | Rarely needed |
Best Practice: Use 3 nodes for most production clusters. 5 nodes for mission-critical systems requiring higher availability. More than 5 adds consensus overhead without significant benefit.
etcd Performance Considerations:
Back up etcd regularly and test your restore procedure. Use etcdctl snapshot save for backups. Store backups off-cluster in durable storage. Many production outages have been caused by etcd data loss or corruption with no working backup.
The kube-apiserver is stateless—it doesn't store any data itself, relying entirely on etcd. This makes horizontal scaling straightforward: run multiple instances behind a load balancer.
HA Configuration:
High Availability API Server Setup: ┌─────────────────────────────┐ │ Load Balancer (L4) │ │ (HAProxy, nginx, cloud) │ │ Endpoint: 10.0.0.100 │ └─────────────┬───────────────┘ │ ┌───────────────────────┼───────────────────────┐ │ │ │ ▼ ▼ ▼┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│ kube-apiserver │ │ kube-apiserver │ │ kube-apiserver ││ (node-1) │ │ (node-2) │ │ (node-3) ││ Port 6443 │ │ Port 6443 │ │ Port 6443 │└────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ └──────────────────────┼──────────────────────┘ │ ▼ ┌─────────────────────────────┐ │ etcd Cluster │ │ (etcd-1, etcd-2, etcd-3) │ └─────────────────────────────┘ Configuration:- Load balancer health checks /healthz endpoint- All API servers connect to all etcd nodes- Clients (kubectl, kubelets) connect to load balancer VIPLoad Balancer Configuration:
| Setting | Recommended Value | Reason |
|---|---|---|
| Health Check Path | /healthz or /readyz | Verifies API server is operational |
| Health Check Interval | 5-10 seconds | Balance between detection speed and load |
| Connection Timeout | 10-30 seconds | Long enough for slow operations |
| Session Affinity | None | API server is stateless |
Stacked vs. External etcd:
There are two common topologies for HA control planes:
Stacked etcd (co-located):
External etcd (dedicated cluster):
Regularly test failover by stopping an API server instance. Verify that kubectl commands continue to work, that kubelets maintain communication, and that the load balancer correctly removes unhealthy instances. Chaos engineering for your control plane prevents surprises.
While multiple API servers can run simultaneously, the kube-scheduler and kube-controller-manager require only one active instance at a time. Running multiple instances would cause conflicting decisions—multiple schedulers might assign the same Pod to different nodes.
How Leader Election Works:
Kubernetes uses a lease-based leader election mechanism:
kube-system namespace tracks the current leaderLeader Election Sequence: Time 0s:┌─────────────────────────────────────────────────────────────────┐│ scheduler-1 acquires lease (becomes leader) ││ scheduler-2, scheduler-3 in standby (watching lease) │└─────────────────────────────────────────────────────────────────┘ Time 2s:┌─────────────────────────────────────────────────────────────────┐│ scheduler-1 renews lease (still leader) ││ scheduler-2, scheduler-3 still in standby │└─────────────────────────────────────────────────────────────────┘ Time 10s: scheduler-1 crashes┌─────────────────────────────────────────────────────────────────┐│ scheduler-1 stops renewing lease ││ Lease expires after LeaseDurationSeconds (default: 15s) │└─────────────────────────────────────────────────────────────────┘ Time 25s: leader failover┌─────────────────────────────────────────────────────────────────┐│ scheduler-2 sees expired lease, acquires it (becomes leader) ││ scheduler-3 sees scheduler-2 as new leader, stays standby │└─────────────────────────────────────────────────────────────────┘ Lease Object:apiVersion: coordination.k8s.io/v1kind: Leasemetadata: name: kube-scheduler namespace: kube-systemspec: holderIdentity: scheduler-2 leaseDurationSeconds: 15 renewTime: "2024-01-15T10:30:00Z"Tuning Leader Election Parameters:
| Parameter | Default | Impact |
|---|---|---|
| lease-duration | 15s | Time before lease expires |
| renew-deadline | 10s | How long leader tries to renew before giving up |
| retry-period | 2s | How often to retry acquiring lease |
Trade-offs:
Monitoring Leader Status:
# Check current scheduler leader
kubectl get lease -n kube-system kube-scheduler -o yaml
# Check controller-manager leader
kubectl get lease -n kube-system kube-controller-manager -o yaml
Network partitions can theoretically cause brief periods where two instances think they're the leader. The lease mechanism, combined with etcd's strong consistency, minimizes this risk—but it's why the system uses defensive checks and idempotent operations where possible.
Worker nodes are where your applications actually run. Each node runs components that receive instructions from the control plane and execute them, managing Pod lifecycle, networking, and storage.
Worker Node Components:
| Component | Function | Runs As |
|---|---|---|
| kubelet | Pod lifecycle management, container execution | System service (not a container) |
| kube-proxy | Network rules for Service routing | DaemonSet or system service |
| Container Runtime | Runs containers (containerd, CRI-O) | System service |
| CNI Plugin | Pod networking (Calico, Cilium, Flannel) | DaemonSet + host configuration |
Worker Node Internal Architecture: ┌─────────────────────────────────────────────────────────────────┐│ Worker Node │├─────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ kubelet │ ││ │ • Watches API server for Pods assigned to this node │ ││ │ • Manages Pod lifecycle via Container Runtime Interface │ ││ │ • Reports node and Pod status to API server │ ││ │ • Executes probes, manages volumes │ ││ └──────────────────────────┬─────────────────────────────────┘ ││ │ CRI ││ ▼ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ Container Runtime │ ││ │ (containerd, CRI-O) │ ││ │ • Pulls images from registries │ ││ │ • Creates/starts/stops containers │ ││ │ • Manages container storage layers │ ││ └──────────────────────────┬─────────────────────────────────┘ ││ │ OCI ││ ▼ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ Low-Level Runtime │ ││ │ (runc, crun, kata-runtime) │ ││ │ • Sets up namespaces, cgroups, seccomp │ ││ │ • Spawns container process │ ││ └────────────────────────────────────────────────────────────┘ ││ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ kube-proxy │ ││ │ • Watches Services and EndpointSlices │ ││ │ • Updates iptables/IPVS for Service routing │ ││ └────────────────────────────────────────────────────────────┘ ││ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ CNI Plugin │ ││ │ • Configures Pod networking (IP assignment, routes) │ ││ │ • May provide network policies │ ││ └────────────────────────────────────────────────────────────┘ ││ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Pod A │ │ Pod B │ │ Pod C │ ││ │ (Container) │ │ (Containers) │ │ (Container) │ ││ └──────────────┘ └──────────────┘ └──────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘The kubelet runs as a system service, not a container. It needs direct access to the host system to manage containers, mount volumes, configure networking, and interface with the container runtime. This is a common source of confusion for those familiar with container-based deployments.
Kubernetes continuously monitors node health and takes action when nodes become unavailable. Understanding this lifecycle is crucial for capacity planning and incident response.
Node Registration:
When kubelet starts, it registers the node with the API server:
Unknown condition until first heartbeatReadyNode Conditions:
| Condition | Meaning | Impact if True |
|---|---|---|
| Ready | Node is healthy and accepting Pods | No impact (desired state) |
| MemoryPressure | Node is running low on memory | New Pods not scheduled, may trigger eviction |
| DiskPressure | Node disk is nearly full | New Pods not scheduled, may trigger eviction |
| PIDPressure | Too many processes on node | New Pods not scheduled |
| NetworkUnavailable | Node network misconfigured | Node not usable until fixed |
Heartbeat and Failure Detection:
kubelet sends heartbeats to the API server at regular intervals (default: 10 seconds). The control plane uses these to detect node failures:
.status.conditions with current timestampnode-monitor-grace-period (default: 40s) for heartbeatUnknownpod-eviction-timeout (default: 5m), Pods are evicted to other nodesNode Failure Sequence:
Node Failure Timeline: T+0s: Node stops responding (kubelet crash, network failure, etc.) └─► Last heartbeat recorded T+10s: Missed first heartbeat └─► Node still shown as Ready (grace period not exceeded) T+40s: node-monitor-grace-period exceeded └─► Node marked as condition=Unknown └─► Pods on node show status=Unknown T+5m: pod-eviction-timeout exceeded └─► Pods evicted (deleted) from failed node └─► ReplicaSet/Deployment creates replacement Pods └─► Scheduler places new Pods on healthy nodes Note: Actual failover time = 40s + 5m = ~5.5 minutes by defaultFor faster failover, tune node-monitor-grace-period andpod-eviction-timeout (at the cost of more false positives)Default settings mean a failed node isn't replaced for ~5.5 minutes. For faster failover, you can reduce timeouts, but this risks evicting Pods during transient network issues. Consider Pod Disruption Budgets and proper readiness probes to ensure smooth failover.
Understanding how Kubernetes tracks and allocates node resources is essential for capacity planning and avoiding resource contention.
Capacity vs. Allocatable:
| Term | Meaning |
|---|---|
| Capacity | Total resources on the node (actual hardware) |
| Allocatable | Resources available for user Pods (after system reservations) |
Resource Reservations:
Node Resource Model: ┌─────────────────────────────────────────────────────────────────┐│ Node Capacity (Total) ││ (e.g., 8 CPU, 32Gi Memory) │├─────────────────────────────────────────────────────────────────┤│ ┌─────────────────────────────────────────────────────────────┐ ││ │ kube-reserved │ ││ │ Reserved for Kubernetes system daemons │ ││ │ (kubelet, container runtime) │ ││ │ Example: 100m CPU, 1Gi memory │ ││ └─────────────────────────────────────────────────────────────┘ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ system-reserved │ ││ │ Reserved for OS system daemons │ ││ │ (sshd, systemd, etc.) │ ││ │ Example: 100m CPU, 1Gi memory │ ││ └─────────────────────────────────────────────────────────────┘ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ eviction-threshold │ ││ │ Buffer before eviction is triggered │ ││ │ Example: 100Mi memory, 10% disk │ ││ └─────────────────────────────────────────────────────────────┘ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Allocatable │ ││ │ = Capacity - kube-reserved - system-reserved - eviction │ ││ │ Available for user Pods │ ││ │ Example: 7.7 CPU, 29.5Gi memory │ ││ └─────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────┘Viewing Node Resources:
# View node capacity and allocatable
kubectl describe node worker-1
Capacity:
cpu: 8
memory: 32823280Ki
pods: 110
Allocatable:
cpu: 7800m
memory: 31671280Ki
pods: 110
# View current allocation
kubectl describe node worker-1 | grep -A 5 "Allocated resources"
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
cpu 2150m (27%) 4500m (57%)
memory 3Gi (10%) 8Gi (25%)
Extended Resources:
Nodes can advertise custom resources beyond CPU and memory:
nvidia.com/gpu: GPU deviceshugepages-2Mi: 2Mi huge pagesTarget 60-70% average CPU/memory utilization on nodes. This leaves headroom for bursts, rolling updates, and node failures. If one node fails, remaining nodes must absorb its workload—which is impossible if they're already at 90% utilization.
Taints are applied to nodes and repel Pods unless those Pods have matching tolerations. This mechanism enables specialized node pools and workload isolation.
How Taints Work:
A taint has three components:
gpu, dedicated)true, production)Taint Effects:
| Effect | Behavior | Use Case |
|---|---|---|
| NoSchedule | New Pods won't be scheduled (existing stay) | Prevent specific workloads from landing |
| PreferNoSchedule | Try not to schedule, but not guaranteed | Soft preference for placement |
| NoExecute | Evict existing Pods without toleration | Drain node for maintenance |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
# Apply taint to a node# kubectl taint nodes gpu-node-1 gpu=true:NoSchedule # Node with taintapiVersion: v1kind: Nodemetadata: name: gpu-node-1spec: taints: - key: "gpu" value: "true" effect: "NoSchedule" - key: "dedicated" value: "ml-workloads" effect: "NoSchedule" ---# Pod with tolerations (can run on tainted node)apiVersion: v1kind: Podmetadata: name: gpu-trainingspec: tolerations: # Tolerate the gpu taint - key: "gpu" operator: "Equal" value: "true" effect: "NoSchedule" # Tolerate the dedicated taint - key: "dedicated" operator: "Equal" value: "ml-workloads" effect: "NoSchedule" containers: - name: trainer image: ml-training:v1 resources: limits: nvidia.com/gpu: 1 ---# Common toleration patterns: # Tolerate any value for a keytolerations: - key: "gpu" operator: "Exists" effect: "NoSchedule" # Tolerate all taints (dangerous - used for daemonsets)tolerations: - operator: "Exists"Built-in Taints:
Kubernetes automatically applies certain taints:
| Taint | When Applied |
|---|---|
node.kubernetes.io/not-ready | Node condition is not Ready |
node.kubernetes.io/unreachable | Node is unreachable |
node.kubernetes.io/memory-pressure | Memory pressure detected |
node.kubernetes.io/disk-pressure | Disk pressure detected |
node.kubernetes.io/unschedulable | Node is cordoned |
Pods automatically get tolerations for these with a tolerationSeconds (default: 300s for unreachable/not-ready), which delays eviction.
Taints repel Pods (nodes exclude Pods). Node selectors attract Pods (Pods choose nodes). Use taints to reserve nodes for specific workloads; use node selectors when Pods require specific node characteristics. Often used together for complete control.
Understanding what happens when control plane components fail helps you design resilient clusters and respond effectively to incidents.
Component-Level Failures:
| Failed Component | Immediate Impact | Long-term Impact | Running Pods? |
|---|---|---|---|
| API Server (all) | No kubectl, no new deployments | Controllers can't reconcile | Yes, until kubelet cache expires |
| API Server (partial) | Requests to healthy instances work | Reduced capacity | Yes |
| etcd (minority) | Slower writes, possible timeouts | None if back before quorum lost | Yes |
| etcd (quorum lost) | Cluster is read-only | No mutations possible | Yes |
| Scheduler | New Pods stay Pending | Delayed deployments | Yes, existing continue |
| Controller Manager | No reconciliation | ReplicaSets don't scale, broken Pods not replaced | Yes, but no self-healing |
| kubelet | Node appears dead | Pods eventually evicted to other nodes | Maybe (depends on failure mode) |
Key Insight: Running Pods Keep Running
Kubernetes is designed for graceful degradation:
However, without control plane:
If you lose all etcd data without backups, you lose your entire cluster state—all deployments, services, secrets, everything. You'll need to recreate everything from source (GitOps) or restore from backup. This is the worst-case scenario. Always maintain tested etcd backups.
Different deployment scales and reliability requirements call for different cluster topologies.
1. Single-Node Development Cluster:
Single Node (minikube, kind, k3s dev): ┌─────────────────────────────────────────┐│ Single Node │├─────────────────────────────────────────┤│ • kube-apiserver ││ • etcd ││ • kube-scheduler ││ • kube-controller-manager ││ • kubelet ││ • kube-proxy ││ • Container Runtime ││ • Your Pods │└─────────────────────────────────────────┘ Pros: Simple, minimal resourcesCons: Zero fault tolerance, not for production2. Stacked High-Availability Cluster:
Stacked HA (etcd + control plane on same nodes): ┌───────────────────────────┐ │ Load Balancer │ │ (API access) │ └─────────────┬─────────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ ▼ ▼ ▼┌───────────────┐ ┌───────────────┐ ┌───────────────┐│ Control Plane │ │ Control Plane │ │ Control Plane ││ Node 1 │ │ Node 2 │ │ Node 3 │├───────────────┤ ├───────────────┤ ├───────────────┤│ • etcd │ │ • etcd │ │ • etcd ││ • apiserver │ │ • apiserver │ │ • apiserver ││ • scheduler │ │ • scheduler │ │ • scheduler ││ • controller │ │ • controller │ │ • controller │└───────────────┘ └───────────────┘ └───────────────┘ │ ┌─────────────────────┼─────────────────────┐ ▼ ▼ ▼┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ Worker Node │ │ Worker Node │ │ Worker Node ││ 1...N │ │ (more) │ │ (more) │└──────────────┘ └──────────────┘ └──────────────┘ Pros: Simpler setup, fewer machinesCons: Losing a node loses both etcd member and control plane3. External etcd Cluster:
External etcd (dedicated etcd machines): ┌───────────────────────────┐│ etcd Cluster ││ (dedicated, isolated) │├───────────────────────────┤│ etcd-1 etcd-2 etcd-3 │└───────────────┬───────────┘ │ ┌─────────┼─────────┐ │ │ │ ▼ ▼ ▼┌─────────┐ ┌─────────┐ ┌─────────┐│ API / │ │ API / │ │ API / ││ Sched / │ │ Sched / │ │ Sched / ││ Ctrl │ │ Ctrl │ │ Ctrl │└─────────┘ └─────────┘ └─────────┘ │ │ │ └─────────┼─────────┘ │ Worker Nodes Pros: etcd isolation, independent scalingCons: More machines, more complexityFor production, spread control plane nodes across availability zones. 3 AZs with one etcd node each ensures zone failure doesn't lose quorum. Consider network latency impact on etcd performance when spanning AZs.
We've covered the complete architecture of Kubernetes control plane and worker nodes. Let's consolidate the key takeaways:
What's Next:
Now that you understand the infrastructure layer, we'll explore declarative configuration—how Kubernetes uses YAML manifests and the reconciliation model to manage desired state.
You now have a deep understanding of Kubernetes' distributed architecture—control plane components, worker node architecture, failure scenarios, and high availability patterns. This knowledge is essential for designing, operating, and troubleshooting production clusters. Next, we'll dive into declarative configuration.