Kubernetes Architecture - Learning Module

Loading content...

0/273

Control Plane and Nodes: The Brain and Body of Kubernetes

Understanding Kubernetes' Distributed Architecture

A Kubernetes cluster is a distributed system with a clear separation between the components that make decisions (the control plane) and the components that execute those decisions (the worker nodes). This architectural split is fundamental to Kubernetes' reliability, scalability, and operational model.

In production environments, understanding this separation is critical. When something goes wrong—and it will—knowing whether the issue lies in the control plane or on a worker node determines your troubleshooting path. When planning for high availability, you need to understand which components need redundancy and how they coordinate.

What You Will Learn

By the end of this page, you will understand the complete architecture of both the control plane and worker nodes. You'll know how to configure high availability, recognize failure scenarios, and make informed decisions about cluster topology for different scales and reliability requirements.

Control Plane Architecture Overview

The control plane (historically called the "master") is the brain of the Kubernetes cluster. It exposes the Kubernetes API, tracks the state of all cluster objects, and makes scheduling decisions. A well-designed control plane is highly available and can survive individual component failures.

Control Plane Components:

Control Plane Component Summary
Component	Primary Function	Stateful?	HA Strategy
kube-apiserver	API gateway for all cluster communication	No	Multiple replicas behind load balancer
etcd	Distributed key-value store for cluster state	Yes	Odd-numbered cluster (3, 5, 7)
kube-scheduler	Pod-to-node assignment decisions	No	Leader election (active/standby)
kube-controller-manager	Runs reconciliation controllers	No	Leader election (active/standby)
cloud-controller-manager	Cloud provider integration	No	Leader election (active/standby)

Communication Patterns:

The control plane components communicate through specific patterns:

API Server as Hub: All components communicate through the API server; no direct component-to-component communication
etcd Access: Only the API server reads from and writes to etcd; this protects data consistency
Watch-Based Events: Controllers and scheduler watch the API server for changes rather than polling
Leader Election: Scheduler and controller-manager use leader election to ensure only one active instance

control-plane-topology.txt
Control Plane Communication Topology:
 
┌─────────────────────────────────────────────────────────────────┐
│                      CONTROL PLANE                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│    ┌──────────────┐      ┌──────────────┐      ┌──────────────┐ │
│    │    etcd-1    │◄────►│    etcd-2    │◄────►│    etcd-3    │ │
│    └──────┬───────┘      └──────┬───────┘      └──────┬───────┘ │
│           │                     │                     │         │
│           └─────────────────────┼─────────────────────┘         │
│                                 │                               │
│                                 ▼                               │
│    ┌─────────────────────────────────────────────────────────┐  │
│    │               kube-apiserver cluster                     │  │
│    │     (api-1)       (api-2)       (api-3)                 │  │
│    │         │             │             │                    │  │
│    │         └─────────────┼─────────────┘                    │  │
│    │                       │                                  │  │
│    │              Load Balancer                               │  │
│    └─────────────────────────────────────────────────────────┘  │
│                                 │                               │
│           ┌─────────────────────┼─────────────────────┐         │
│           ▼                     ▼                     ▼         │
│   ┌───────────────┐   ┌─────────────────┐   ┌───────────────┐  │
│   │ kube-scheduler│   │kube-controller- │   │cloud-controller│  │
│   │ (leader elect)│   │manager (leader) │   │-manager (leader│  │
│   └───────────────┘   └─────────────────┘   └───────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
                    ┌─────────────────────────┐
                    │      Worker Nodes       │
                    │  (kubelet, kube-proxy)  │
                    └─────────────────────────┘

Managed Kubernetes Abstracts This

In managed services (EKS, GKE, AKS), the control plane is fully managed by the cloud provider. You don't see or manage these components—they're abstracted behind an API endpoint. Understanding the architecture still helps with debugging and advanced configurations.

etcd: The Foundation of Cluster State

etcd is arguably the most critical component in a Kubernetes cluster. It's the sole source of truth for all cluster data—losing etcd means losing your entire cluster configuration.

Why etcd for Kubernetes?

Kubernetes chose etcd because it provides:

Strong Consistency: Linearizable reads and writes via Raft consensus
Watch Support: Efficient notifications for state changes
Transactional Updates: Multi-key transactions for complex operations
Proven Reliability: Battle-tested at scale by many organizations

How Kubernetes Uses etcd:

etcd-key-structure.txt
Kubernetes Data Structure in etcd:
 
/registry/
├── pods/
│   ├── default/
│   │   ├── nginx-abc123
│   │   └── web-app-xyz789
│   └── kube-system/
│       ├── coredns-abc123
│       └── kube-proxy-xyz789
├── deployments/
│   └── default/
│       └── web-app
├── replicasets/
│   └── default/
│       └── web-app-5d7f8b9c4d
├── services/
│   └── default/
│       └── web-service
├── secrets/
│   └── default/
│       └── db-credentials
├── configmaps/
│   └── default/
│       └── app-config
├── namespaces/
│   ├── default
│   ├── kube-system
│   └── production
└── ...
 
Each key stores the JSON/protobuf encoded resource.
Watch on prefixes enables efficient notifications.

etcd Cluster Sizing:

The Raft consensus algorithm requires a majority (quorum) for writes:

Cluster Size	Quorum	Tolerated Failures	Recommendation
1	1	0	Development only
3	2	1	Small production
5	3	2	Large production
7	4	3	Rarely needed

Best Practice: Use 3 nodes for most production clusters. 5 nodes for mission-critical systems requiring higher availability. More than 5 adds consensus overhead without significant benefit.

etcd Performance Considerations:

etcd Performance Factors

•Disk I/O: etcd is write-heavy; use fast SSDs (NVMe preferred). Network-attached storage introduces unacceptable latency.
•Network Latency: Inter-node latency impacts commit time. Keep etcd nodes in the same availability zone or data center.
•Memory: etcd keeps recent data in memory. Size based on number of resources in cluster.
•CPU: Generally not a bottleneck, but ensure sufficient for peak load.
•Compaction: Regular compaction prevents unbounded storage growth. Enable auto-compaction.

etcd is Your Single Point of Truth

Back up etcd regularly and test your restore procedure. Use etcdctl snapshot save for backups. Store backups off-cluster in durable storage. Many production outages have been caused by etcd data loss or corruption with no working backup.

API Server High Availability

The kube-apiserver is stateless—it doesn't store any data itself, relying entirely on etcd. This makes horizontal scaling straightforward: run multiple instances behind a load balancer.

HA Configuration:

apiserver-ha-setup.txt
High Availability API Server Setup:
 
                    ┌─────────────────────────────┐
                    │     Load Balancer (L4)      │
                    │   (HAProxy, nginx, cloud)   │
                    │    Endpoint: 10.0.0.100     │
                    └─────────────┬───────────────┘
                                  │
          ┌───────────────────────┼───────────────────────┐
          │                       │                       │
          ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  kube-apiserver │    │  kube-apiserver │    │  kube-apiserver │
│   (node-1)      │    │   (node-2)      │    │   (node-3)      │
│   Port 6443     │    │   Port 6443     │    │   Port 6443     │
└────────┬────────┘    └────────┬────────┘    └────────┬────────┘
         │                      │                      │
         └──────────────────────┼──────────────────────┘
                                │
                                ▼
                    ┌─────────────────────────────┐
                    │      etcd Cluster           │
                    │  (etcd-1, etcd-2, etcd-3)   │
                    └─────────────────────────────┘
 
Configuration:
- Load balancer health checks /healthz endpoint
- All API servers connect to all etcd nodes
- Clients (kubectl, kubelets) connect to load balancer VIP

Load Balancer Configuration:

Setting	Recommended Value	Reason
Health Check Path	/healthz or /readyz	Verifies API server is operational
Health Check Interval	5-10 seconds	Balance between detection speed and load
Connection Timeout	10-30 seconds	Long enough for slow operations
Session Affinity	None	API server is stateless

Stacked vs. External etcd:

There are two common topologies for HA control planes:

Stacked etcd (co-located):

etcd runs on same nodes as other control plane components
Pros: Simpler setup, fewer machines
Cons: Node failure loses both etcd member and control plane instance

External etcd (dedicated cluster):

etcd runs on dedicated nodes, separate from API servers
Pros: etcd failures don't affect API servers; independent scaling
Cons: More machines, more complexity

Testing API Server Failover

Regularly test failover by stopping an API server instance. Verify that kubectl commands continue to work, that kubelets maintain communication, and that the load balancer correctly removes unhealthy instances. Chaos engineering for your control plane prevents surprises.

Leader Election for Controllers and Scheduler

While multiple API servers can run simultaneously, the kube-scheduler and kube-controller-manager require only one active instance at a time. Running multiple instances would cause conflicting decisions—multiple schedulers might assign the same Pod to different nodes.

How Leader Election Works:

Kubernetes uses a lease-based leader election mechanism:

Lease Resource: A Lease object in the kube-system namespace tracks the current leader
Leader Acquisition: Instances compete to acquire the lease by writing to the object
Heartbeat: The leader renews the lease periodically (default: 2 seconds)
Failover: If the leader fails to renew, another instance acquires the lease

leader-election.txt
Leader Election Sequence:
 
Time 0s:
┌─────────────────────────────────────────────────────────────────┐
│  scheduler-1 acquires lease (becomes leader)                    │
│  scheduler-2, scheduler-3 in standby (watching lease)           │
└─────────────────────────────────────────────────────────────────┘
 
Time 2s:
┌─────────────────────────────────────────────────────────────────┐
│  scheduler-1 renews lease (still leader)                        │
│  scheduler-2, scheduler-3 still in standby                      │
└─────────────────────────────────────────────────────────────────┘
 
Time 10s: scheduler-1 crashes
┌─────────────────────────────────────────────────────────────────┐
│  scheduler-1 stops renewing lease                               │
│  Lease expires after LeaseDurationSeconds (default: 15s)        │
└─────────────────────────────────────────────────────────────────┘
 
Time 25s: leader failover
┌─────────────────────────────────────────────────────────────────┐
│  scheduler-2 sees expired lease, acquires it (becomes leader)   │
│  scheduler-3 sees scheduler-2 as new leader, stays standby      │
└─────────────────────────────────────────────────────────────────┘
 
Lease Object:
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-scheduler
  namespace: kube-system
spec:
  holderIdentity: scheduler-2
  leaseDurationSeconds: 15
  renewTime: "2024-01-15T10:30:00Z"

Tuning Leader Election Parameters:

Parameter	Default	Impact
lease-duration	15s	Time before lease expires
renew-deadline	10s	How long leader tries to renew before giving up
retry-period	2s	How often to retry acquiring lease

Trade-offs:

Shorter durations = faster failover, but higher API server load and risk of false failovers
Longer durations = slower failover, but more stable in network jitter scenarios

Monitoring Leader Status:

# Check current scheduler leader
kubectl get lease -n kube-system kube-scheduler -o yaml

# Check controller-manager leader
kubectl get lease -n kube-system kube-controller-manager -o yaml

Split-Brain Scenarios

Network partitions can theoretically cause brief periods where two instances think they're the leader. The lease mechanism, combined with etcd's strong consistency, minimizes this risk—but it's why the system uses defensive checks and idempotent operations where possible.

Worker Node Architecture

Worker nodes are where your applications actually run. Each node runs components that receive instructions from the control plane and execute them, managing Pod lifecycle, networking, and storage.

Worker Node Components:

Worker Node Components
Component	Function	Runs As
kubelet	Pod lifecycle management, container execution	System service (not a container)
kube-proxy	Network rules for Service routing	DaemonSet or system service
Container Runtime	Runs containers (containerd, CRI-O)	System service
CNI Plugin	Pod networking (Calico, Cilium, Flannel)	DaemonSet + host configuration

worker-node-architecture.txt
Worker Node Internal Architecture:
 
┌─────────────────────────────────────────────────────────────────┐
│                        Worker Node                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                        kubelet                              │ │
│  │  • Watches API server for Pods assigned to this node       │ │
│  │  • Manages Pod lifecycle via Container Runtime Interface    │ │
│  │  • Reports node and Pod status to API server               │ │
│  │  • Executes probes, manages volumes                        │ │
│  └──────────────────────────┬─────────────────────────────────┘ │
│                             │ CRI                               │
│                             ▼                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    Container Runtime                        │ │
│  │  (containerd, CRI-O)                                       │ │
│  │  • Pulls images from registries                            │ │
│  │  • Creates/starts/stops containers                         │ │
│  │  • Manages container storage layers                        │ │
│  └──────────────────────────┬─────────────────────────────────┘ │
│                             │ OCI                               │
│                             ▼                                   │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    Low-Level Runtime                        │ │
│  │  (runc, crun, kata-runtime)                                │ │
│  │  • Sets up namespaces, cgroups, seccomp                    │ │
│  │  • Spawns container process                                │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                       kube-proxy                            │ │
│  │  • Watches Services and EndpointSlices                     │ │
│  │  • Updates iptables/IPVS for Service routing               │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    CNI Plugin                               │ │
│  │  • Configures Pod networking (IP assignment, routes)       │ │
│  │  • May provide network policies                            │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐             │
│  │    Pod A     │ │    Pod B     │ │    Pod C     │             │
│  │  (Container) │ │ (Containers) │ │  (Container) │             │
│  └──────────────┘ └──────────────┘ └──────────────┘             │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

kubelet Cannot Run in a Container

The kubelet runs as a system service, not a container. It needs direct access to the host system to manage containers, mount volumes, configure networking, and interface with the container runtime. This is a common source of confusion for those familiar with container-based deployments.

Node Lifecycle and Health Monitoring

Kubernetes continuously monitors node health and takes action when nodes become unavailable. Understanding this lifecycle is crucial for capacity planning and incident response.

Node Registration:

When kubelet starts, it registers the node with the API server:

kubelet discovers node information (hostname, IPs, resources)
kubelet creates or updates a Node object in the API server
Node is initially in Unknown condition until first heartbeat
After successful heartbeat, node becomes Ready

Node Conditions:

Node Condition Types
Condition	Meaning	Impact if True
Ready	Node is healthy and accepting Pods	No impact (desired state)
MemoryPressure	Node is running low on memory	New Pods not scheduled, may trigger eviction
DiskPressure	Node disk is nearly full	New Pods not scheduled, may trigger eviction
PIDPressure	Too many processes on node	New Pods not scheduled
NetworkUnavailable	Node network misconfigured	Node not usable until fixed

Heartbeat and Failure Detection:

kubelet sends heartbeats to the API server at regular intervals (default: 10 seconds). The control plane uses these to detect node failures:

Node Heartbeat: kubelet updates Node's .status.conditions with current timestamp
Grace Period: Node controller waits node-monitor-grace-period (default: 40s) for heartbeat
Unknown State: After grace period, node is marked as Unknown
Pod Eviction: After pod-eviction-timeout (default: 5m), Pods are evicted to other nodes

Node Failure Sequence:

node-failure-timeline.txt
Node Failure Timeline:
 
T+0s:   Node stops responding (kubelet crash, network failure, etc.)
        └─► Last heartbeat recorded
 
T+10s:  Missed first heartbeat
        └─► Node still shown as Ready (grace period not exceeded)
 
T+40s:  node-monitor-grace-period exceeded
        └─► Node marked as condition=Unknown
        └─► Pods on node show status=Unknown
 
T+5m:   pod-eviction-timeout exceeded
        └─► Pods evicted (deleted) from failed node
        └─► ReplicaSet/Deployment creates replacement Pods
        └─► Scheduler places new Pods on healthy nodes
 
Note: Actual failover time = 40s + 5m = ~5.5 minutes by default
For faster failover, tune node-monitor-grace-period and
pod-eviction-timeout (at the cost of more false positives)

Failover Is Not Instant

Default settings mean a failed node isn't replaced for ~5.5 minutes. For faster failover, you can reduce timeouts, but this risks evicting Pods during transient network issues. Consider Pod Disruption Budgets and proper readiness probes to ensure smooth failover.

Node Resources and Capacity Planning

Understanding how Kubernetes tracks and allocates node resources is essential for capacity planning and avoiding resource contention.

Capacity vs. Allocatable:

Term	Meaning
Capacity	Total resources on the node (actual hardware)
Allocatable	Resources available for user Pods (after system reservations)

Resource Reservations:

resource-allocation.txt
Node Resource Model:
 
┌─────────────────────────────────────────────────────────────────┐
│                     Node Capacity (Total)                        │
│                    (e.g., 8 CPU, 32Gi Memory)                    │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │                    kube-reserved                             │ │
│ │  Reserved for Kubernetes system daemons                     │ │
│ │  (kubelet, container runtime)                               │ │
│ │  Example: 100m CPU, 1Gi memory                               │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │                   system-reserved                            │ │
│ │  Reserved for OS system daemons                              │ │
│ │  (sshd, systemd, etc.)                                       │ │
│ │  Example: 100m CPU, 1Gi memory                               │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │                  eviction-threshold                          │ │
│ │  Buffer before eviction is triggered                        │ │
│ │  Example: 100Mi memory, 10% disk                            │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │                     Allocatable                              │ │
│ │  = Capacity - kube-reserved - system-reserved - eviction    │ │
│ │  Available for user Pods                                     │ │
│ │  Example: 7.7 CPU, 29.5Gi memory                            │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Viewing Node Resources:

# View node capacity and allocatable
kubectl describe node worker-1

Capacity:
  cpu:             8
  memory:          32823280Ki
  pods:            110
Allocatable:
  cpu:             7800m
  memory:          31671280Ki
  pods:            110

# View current allocation
kubectl describe node worker-1 | grep -A 5 "Allocated resources"

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  cpu                2150m (27%)  4500m (57%)
  memory             3Gi (10%)    8Gi (25%)

Extended Resources:

Nodes can advertise custom resources beyond CPU and memory:

nvidia.com/gpu: GPU devices
hugepages-2Mi: 2Mi huge pages
Custom resources via device plugins

Capacity Planning Best Practice

Target 60-70% average CPU/memory utilization on nodes. This leaves headroom for bursts, rolling updates, and node failures. If one node fails, remaining nodes must absorb its workload—which is impossible if they're already at 90% utilization.

Taints and Tolerations: Controlling Pod Placement

Taints are applied to nodes and repel Pods unless those Pods have matching tolerations. This mechanism enables specialized node pools and workload isolation.

How Taints Work:

A taint has three components:

Key: A name for the taint (e.g., gpu, dedicated)
Value: Optional value (e.g., true, production)
Effect: What happens when a Pod doesn't tolerate the taint

Taint Effects:

Taint Effect Types
Effect	Behavior	Use Case
NoSchedule	New Pods won't be scheduled (existing stay)	Prevent specific workloads from landing
PreferNoSchedule	Try not to schedule, but not guaranteed	Soft preference for placement
NoExecute	Evict existing Pods without toleration	Drain node for maintenance

taints-tolerations.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Apply taint to a node
# kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
 
# Node with taint
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
spec:
  taints:
    - key: "gpu"
      value: "true"
      effect: "NoSchedule"
    - key: "dedicated"
      value: "ml-workloads"
      effect: "NoSchedule"
 
---
# Pod with tolerations (can run on tainted node)
apiVersion: v1
kind: Pod
metadata:
  name: gpu-training
spec:
  tolerations:
    # Tolerate the gpu taint
    - key: "gpu"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
    # Tolerate the dedicated taint
    - key: "dedicated"
      operator: "Equal"
      value: "ml-workloads"
      effect: "NoSchedule"
  containers:
    - name: trainer
      image: ml-training:v1
      resources:
        limits:
          nvidia.com/gpu: 1
 
---
# Common toleration patterns:
 
# Tolerate any value for a key
tolerations:
  - key: "gpu"
    operator: "Exists"
    effect: "NoSchedule"
 
# Tolerate all taints (dangerous - used for daemonsets)
tolerations:
  - operator: "Exists"

Built-in Taints:

Kubernetes automatically applies certain taints:

Taint	When Applied
`node.kubernetes.io/not-ready`	Node condition is not Ready
`node.kubernetes.io/unreachable`	Node is unreachable
`node.kubernetes.io/memory-pressure`	Memory pressure detected
`node.kubernetes.io/disk-pressure`	Disk pressure detected
`node.kubernetes.io/unschedulable`	Node is cordoned

Pods automatically get tolerations for these with a tolerationSeconds (default: 300s for unreachable/not-ready), which delays eviction.

Taints vs. Node Selectors

Taints repel Pods (nodes exclude Pods). Node selectors attract Pods (Pods choose nodes). Use taints to reserve nodes for specific workloads; use node selectors when Pods require specific node characteristics. Often used together for complete control.

Control Plane Failure Scenarios and Recovery

Understanding what happens when control plane components fail helps you design resilient clusters and respond effectively to incidents.

Component-Level Failures:

Control Plane Failure Impact
Failed Component	Immediate Impact	Long-term Impact	Running Pods?
API Server (all)	No kubectl, no new deployments	Controllers can't reconcile	Yes, until kubelet cache expires
API Server (partial)	Requests to healthy instances work	Reduced capacity	Yes
etcd (minority)	Slower writes, possible timeouts	None if back before quorum lost	Yes
etcd (quorum lost)	Cluster is read-only	No mutations possible	Yes
Scheduler	New Pods stay Pending	Delayed deployments	Yes, existing continue
Controller Manager	No reconciliation	ReplicaSets don't scale, broken Pods not replaced	Yes, but no self-healing
kubelet	Node appears dead	Pods eventually evicted to other nodes	Maybe (depends on failure mode)

Key Insight: Running Pods Keep Running

Kubernetes is designed for graceful degradation:

Existing Pods don't need the control plane to run — Pods execute on nodes independently
kubelet caches Pod specs — Even if API server is unavailable, kubelet knows what to do
Container runtime operates independently — Containers don't require API server
kube-proxy rules persist — Networking continues working

However, without control plane:

New Pods can't be scheduled
Scaling operations don't work
Failed Pods aren't replaced
Rolling updates pause
Secrets/ConfigMaps can't be updated

Complete etcd Loss

If you lose all etcd data without backups, you lose your entire cluster state—all deployments, services, secrets, everything. You'll need to recreate everything from source (GitOps) or restore from backup. This is the worst-case scenario. Always maintain tested etcd backups.

Cluster Topology Patterns

Different deployment scales and reliability requirements call for different cluster topologies.

1. Single-Node Development Cluster:

single-node-cluster.txt
Single Node (minikube, kind, k3s dev):
 
┌─────────────────────────────────────────┐
│              Single Node                 │
├─────────────────────────────────────────┤
│  • kube-apiserver                       │
│  • etcd                                 │
│  • kube-scheduler                       │
│  • kube-controller-manager              │
│  • kubelet                              │
│  • kube-proxy                           │
│  • Container Runtime                    │
│  • Your Pods                            │
└─────────────────────────────────────────┘
 
Pros: Simple, minimal resources
Cons: Zero fault tolerance, not for production

2. Stacked High-Availability Cluster:

stacked-ha-cluster.txt
Stacked HA (etcd + control plane on same nodes):
 
              ┌───────────────────────────┐
              │       Load Balancer       │
              │        (API access)       │
              └─────────────┬─────────────┘
                            │
      ┌─────────────────────┼─────────────────────┐
      │                     │                     │
      ▼                     ▼                     ▼
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│ Control Plane │   │ Control Plane │   │ Control Plane │
│    Node 1     │   │    Node 2     │   │    Node 3     │
├───────────────┤   ├───────────────┤   ├───────────────┤
│ • etcd        │   │ • etcd        │   │ • etcd        │
│ • apiserver   │   │ • apiserver   │   │ • apiserver   │
│ • scheduler   │   │ • scheduler   │   │ • scheduler   │
│ • controller  │   │ • controller  │   │ • controller  │
└───────────────┘   └───────────────┘   └───────────────┘
                            │
      ┌─────────────────────┼─────────────────────┐
      ▼                     ▼                     ▼
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ Worker Node  │   │ Worker Node  │   │ Worker Node  │
│    1...N     │   │   (more)     │   │   (more)     │
└──────────────┘   └──────────────┘   └──────────────┘
 
Pros: Simpler setup, fewer machines
Cons: Losing a node loses both etcd member and control plane

3. External etcd Cluster:

external-etcd-cluster.txt
External etcd (dedicated etcd machines):
 
┌───────────────────────────┐
│       etcd Cluster        │
│  (dedicated, isolated)    │
├───────────────────────────┤
│   etcd-1  etcd-2  etcd-3  │
└───────────────┬───────────┘
                │
      ┌─────────┼─────────┐
      │         │         │
      ▼         ▼         ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ API /   │ │ API /   │ │ API /   │
│ Sched / │ │ Sched / │ │ Sched / │
│ Ctrl    │ │ Ctrl    │ │ Ctrl    │
└─────────┘ └─────────┘ └─────────┘
      │         │         │
      └─────────┼─────────┘
                │
         Worker Nodes
 
Pros: etcd isolation, independent scaling
Cons: More machines, more complexity

Multi-AZ Deployment

For production, spread control plane nodes across availability zones. 3 AZs with one etcd node each ensures zone failure doesn't lose quorum. Consider network latency impact on etcd performance when spanning AZs.

Summary: Control Plane and Worker Nodes

We've covered the complete architecture of Kubernetes control plane and worker nodes. Let's consolidate the key takeaways:

Key Takeaways

•Control plane makes decisions; data plane executes them: Clear separation of concerns enables independent scaling and resilience
•etcd is the foundation of cluster state: Requires careful backup, sizing, and high availability planning
•API server is the only etcd client: All components communicate through the API server
•Scheduler and controller-manager use leader election: Only one active instance prevents conflicting decisions
•Worker nodes run kubelet, kube-proxy, and container runtime: These execute control plane decisions
•Node health detection takes time: Default failover is ~5.5 minutes; tune carefully for your requirements
•Taints and tolerations control Pod placement: Use for specialized node pools and workload isolation
•Running Pods survive control plane failures: Kubernetes degrades gracefully; new operations fail, existing work continues

What's Next:

Now that you understand the infrastructure layer, we'll explore declarative configuration—how Kubernetes uses YAML manifests and the reconciliation model to manage desired state.

Page Complete

You now have a deep understanding of Kubernetes' distributed architecture—control plane components, worker node architecture, failure scenarios, and high availability patterns. This knowledge is essential for designing, operating, and troubleshooting production clusters. Next, we'll dive into declarative configuration.