Loading content...
Kubernetes has become the de facto standard for container orchestration, powering workloads from the smallest startups to the largest enterprises and hyperscalers. But beneath its powerful abstractions lies a carefully designed distributed system with distinct components, each serving a critical purpose.
Understanding Kubernetes components isn't just academic knowledge—it's essential for debugging production issues, capacity planning, security hardening, and designing resilient architectures. When your application deployment fails, knowing whether the problem lies with the API Server, Scheduler, Controller Manager, or kubelet fundamentally changes your troubleshooting approach.
By the end of this page, you will have a comprehensive understanding of every core Kubernetes component, how they communicate, their failure modes, and how this architecture enables the self-healing, declarative nature of Kubernetes. You'll be able to reason about cluster behavior at the component level—a skill that separates operators who merely deploy from those who truly understand.
Kubernetes follows a control plane / data plane architecture pattern common in distributed systems. This separation provides clear boundaries between the components that make decisions and the components that execute work.
The Control Plane (also called the master components) is the brain of the cluster. It maintains the desired state, makes scheduling decisions, responds to cluster events, and exposes the API. The control plane doesn't run your application workloads—it orchestrates them.
The Data Plane (the worker nodes) is where your containers actually run. Each node hosts the components necessary to run Pods and communicate with the control plane. The data plane executes the decisions made by the control plane.
| Component Type | Components | Primary Responsibility |
|---|---|---|
| Control Plane | kube-apiserver, etcd, kube-scheduler, kube-controller-manager, cloud-controller-manager | Cluster state management, scheduling, reconciliation |
| Data Plane | kubelet, kube-proxy, container runtime | Running containers, networking, health monitoring |
Why this separation matters:
This architectural division enables several critical capabilities:
Scalability: Control plane and data plane can scale independently. You can add hundreds of worker nodes without proportionally scaling control plane components.
Isolation: Control plane failures don't immediately kill running workloads. If the API server goes down, existing Pods continue running—you just can't make changes.
Security: The control plane can be isolated in separate network segments, reducing attack surface. Worker nodes only need limited access to specific control plane endpoints.
Maintenance: Control plane components can be upgraded or restarted with minimal impact on running workloads.
In production, control plane components are typically run with high availability—usually three or more replicas spread across availability zones. Managed Kubernetes services (EKS, GKE, AKS) abstract this away, managing the control plane for you while you focus on the data plane.
The kube-apiserver is the central hub of all Kubernetes communication. Every interaction with the cluster—whether from kubectl, controllers, the scheduler, or kubelets—goes through the API server. It's not just a gateway; it's the single source of truth for the cluster's current state.
Core Responsibilities:
API Endpoint Exposure: Serves the Kubernetes REST API over HTTPS, handling CRUD operations for all Kubernetes objects (Pods, Services, ConfigMaps, etc.)
Authentication & Authorization: Validates the identity of all requests (via certificates, tokens, etc.) and enforces RBAC policies to determine what actions are permitted.
Admission Control: Runs admission controllers that can mutate or validate requests before they're persisted. This is where policies like resource quotas, pod security standards, and webhook-based validations are enforced.
etcd Gateway: Serves as the only component that directly communicates with etcd. All cluster state changes go through the API server to etcd.
Watch Mechanism: Supports efficient watches that allow clients to subscribe to changes rather than polling. This is how controllers learn about new or modified resources.
Request Flow Through the API Server: 1. Request Arrives └─► Authentication ├─ Client Certificate ├─ Bearer Token ├─ OpenID Connect └─ Webhook Token Authentication 2. Authorization (after authentication succeeds) └─► RBAC Check ├─ Role/ClusterRole lookup ├─ RoleBinding/ClusterRoleBinding check └─ Decision: Allow/Deny 3. Admission Controllers (if authorized) └─► Mutating Admission ├─ Default values injection ├─ Sidecar injection (Istio, etc.) └─ Label/annotation addition └─► Validating Admission ├─ Resource quota check ├─ Pod security policies └─ Custom webhook validations 4. Persistence (if all checks pass) └─► Write to etcd └─► Return success response 5. Notification └─► Watch subscribers notified of changeScalability Characteristics:
The API server is designed to be horizontally scalable. In high-availability setups, multiple API server instances run behind a load balancer. Because all state is stored in etcd, any API server instance can handle any request. This stateless design is critical for production deployments.
Key Performance Considerations:
While the API server is horizontally scalable, it remains the central communication hub. If all API server instances become unavailable, you lose the ability to make any cluster changes—though existing workloads continue running. This makes API server availability critical for operational control.
etcd is a distributed, consistent key-value store that serves as Kubernetes' backing store for all cluster data. Every object you create—every Pod, Service, ConfigMap, Secret—is persisted in etcd. It's not just storage; it's the foundation of Kubernetes' consistency guarantees.
Why etcd?
Kubernetes requires a storage system with very specific properties:
Strong Consistency: When the API server writes a Pod spec, that write must be immediately visible to all readers. etcd provides this through the Raft consensus algorithm.
Watch Support: Controllers need to react to changes. etcd's native watch capability allows efficient event notification without polling.
Distributed & Fault-Tolerant: etcd can tolerate node failures while maintaining consistency. A 3-node cluster survives 1 failure; a 5-node cluster survives 2.
Transactional Operations: Compare-and-swap operations enable safe concurrent updates—critical for controllers competing to update resources.
| Cluster Size | Failure Tolerance | Recommended Use Case |
|---|---|---|
| 1 node | 0 failures | Development/testing only |
| 3 nodes | 1 failure | Small production clusters |
| 5 nodes | 2 failures | Large production clusters requiring higher availability |
| 7 nodes | 3 failures | Rarely needed; additional nodes add consensus overhead |
Data Organization in etcd:
Kubernetes organizes data in etcd using a hierarchical key structure:
/registry/pods/<namespace>/<pod-name>
/registry/services/<namespace>/<service-name>
/registry/secrets/<namespace>/<secret-name>
/registry/deployments/<namespace>/<deployment-name>
This organization enables efficient prefix-based watches (watch all pods in a namespace) and range queries.
The Raft Consensus Algorithm:
etcd uses Raft for distributed consensus. Here's how it works at a high level:
Losing etcd data means losing your entire cluster configuration. Regular etcd snapshots are essential. Many production incidents have been caused by etcd corruption or loss. Use etcdctl snapshot save regularly and store backups off-cluster.
Performance Tuning Considerations:
The kube-scheduler watches for newly created Pods that have no assigned node and selects an appropriate node for them to run on. This is a non-trivial problem—the scheduler must consider resource requirements, affinity/anti-affinity rules, taints and tolerations, data locality, and many other factors.
The Scheduling Process:
Scheduling happens in two phases:
1. Filtering Phase (Predicates) The scheduler filters out nodes that cannot run the Pod:
2. Scoring Phase (Priorities) Among feasible nodes, the scheduler scores each to find the optimal placement:
Scheduling Decision Flow: Pod Created (no nodeName) ──► Scheduler Picks Up │ ▼ ┌─────────────────────┐ │ FILTERING PHASE │ └─────────────────────┘ │ ┌──────────────────────────┼──────────────────────────┐ ▼ ▼ ▼ Node Affinity? Resource Fit? Tolerations? │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Feasible Nodes (pass all filters) │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────┐ │ SCORING PHASE │ └─────────────────────┘ │ ┌──────────────────────────┼──────────────────────────┐ ▼ ▼ ▼ Resource Balance Spreading Score Locality Score (LeastRequested) (Pod Anti-Affinity) (Volume Zone) │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Final Score = Σ (weight × individual_score) │ └─────────────────────────────────────────────────────────────┘ │ ▼ Node with Highest Score Wins │ ▼ Bind Pod to Node (write to etcd)Scheduler Extensibility:
The default scheduler can be extended through several mechanisms:
Common Scheduling Failures:
| Failure Reason | Typical Cause | Resolution |
|---|---|---|
| Insufficient cpu/memory | Resource requests exceed available capacity | Add nodes, reduce requests, or use cluster autoscaler |
| No nodes match node selector | Labels mismatch | Correct labels or update selector |
| Taints not tolerated | Node tainted without matching toleration | Add toleration or remove taint |
| Volume zone conflict | Volume in different zone than viable nodes | Create volume in correct zone |
In large clusters (thousands of nodes), scheduler performance becomes critical. The default scheduler uses techniques like node scoring caching and parallel evaluation. For extremely large clusters, consider running multiple schedulers or using scheduling policies to limit the search space.
The kube-controller-manager runs a collection of controllers that watch the cluster state and work to move the current state toward the desired state. This is the heart of Kubernetes' declarative model—you specify what you want, and controllers make it happen.
The Controller Pattern:
Every controller follows the same pattern:
Key Controllers Bundled in kube-controller-manager:
| Controller | Watches | Manages | Key Behavior |
|---|---|---|---|
| ReplicaSet Controller | ReplicaSets, Pods | Pod count | Creates/deletes Pods to match desired replica count |
| Deployment Controller | Deployments, ReplicaSets | ReplicaSet versions | Manages rolling updates, rollbacks, and revision history |
| Node Controller | Nodes | Node health | Marks nodes as unhealthy, evicts Pods from dead nodes |
| Service Account Controller | Namespaces | ServiceAccounts | Creates default service account in new namespaces |
| Endpoint Controller | Services, Pods | Endpoints | Updates endpoint lists as Pods come and go |
| Job Controller | Jobs, Pods | Job completions | Creates Pods for Jobs, tracks completion/failure |
| Namespace Controller | Namespaces | Namespace deletion | Cleans up all resources when namespace deleted |
| PV/PVC Controller | PersistentVolumes, PersistentVolumeClaims | Volume binding | Matches claims to volumes, handles reclaim |
Reconciliation in Action—ReplicaSet Example:
Consider a ReplicaSet with replicas: 3:
This loop runs continuously. If a Pod dies, the controller creates a replacement. If you scale to 5, the controller creates 2 more. The desired state is always maintained.
Leader Election:
In HA setups with multiple control plane nodes, only one instance of kube-controller-manager actively runs controllers—the leader. Others are on standby. If the leader fails, another instance acquires the leader lock and takes over. This prevents conflicting actions from multiple controllers.
Controllers don't guarantee instant state convergence. They operate asynchronously in a level-triggered manner—they reconcile based on the current state, not individual events. This makes them resilient to missed events but means there's always some lag between desired state change and actual state convergence.
The cloud-controller-manager (CCM) contains controllers that interact with cloud provider APIs. This component was extracted from kube-controller-manager to allow cloud providers to develop their integration at their own pace without being tied to Kubernetes release cycles.
Controllers in the CCM:
Node Controller (Cloud Portion)
Route Controller
Service Controller
type: LoadBalancer servicesCloud Controller Manager in Action: ┌────────────────────────────────────────────────────────────────┐│ Service type: LoadBalancer ││ created in API │└────────────────────────────────────────────────────────────────┘ │ ▼┌────────────────────────────────────────────────────────────────┐│ Service Controller (in CCM) watches │└────────────────────────────────────────────────────────────────┘ │ ▼┌────────────────────────────────────────────────────────────────┐│ CCM calls cloud provider API to create LB ││ (e.g., CreateLoadBalancer on AWS ELB/NLB, GCP GCLB) │└────────────────────────────────────────────────────────────────┘ │ ▼┌────────────────────────────────────────────────────────────────┐│ Cloud provider provisions load balancer, returns IP │└────────────────────────────────────────────────────────────────┘ │ ▼┌────────────────────────────────────────────────────────────────┐│ CCM updates Service.status.loadBalancer.ingress ││ with external IP/hostname │└────────────────────────────────────────────────────────────────┘ │ ▼┌────────────────────────────────────────────────────────────────┐│ Traffic flows: Internet → Cloud LB → NodePort → Pod │└────────────────────────────────────────────────────────────────┘Provider Implementations:
Each major cloud provider maintains their own CCM:
Running Without CCM:
On bare-metal or on-premises deployments, you typically don't run a CCM. Without it:
Pending state (use MetalLB or similar for bare-metal LB)In managed Kubernetes services (EKS, GKE, AKS), the CCM is pre-configured and managed for you. You simply create a LoadBalancer service, and a real cloud load balancer appears. Understanding CCM matters when troubleshooting cloud integration issues or running self-managed clusters.
The kubelet is the primary node agent—it runs on every worker node and is responsible for ensuring containers are running as specified. It's the component that actually makes Pods come to life.
Core Responsibilities:
Pod Lifecycle Management
Volume Management
Resource Enforcement
Health Monitoring
kubelet Pod Lifecycle: ┌─────────────────────────────────────────────────────────────────┐│ 1. Pod Scheduled to Node (nodeName set by scheduler) │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ 2. kubelet Sees Pod (watching pods assigned to its node) │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ 3. Volume Setup ││ • Mount ConfigMaps, Secrets, PVCs ││ • Wait for volume attachment (if cloud volumes) │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ 4. Image Pull ││ • Check local cache ││ • Pull from registry if needed (using imagePullSecrets) │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ 5. Container Creation ││ • Create sandbox (pause container for networking) ││ • Create app containers via Container Runtime Interface ││ • Apply cgroup limits, security contexts │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ 6. Container Start ││ • Start containers in order (init containers first) ││ • Execute postStart lifecycle hooks │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ 7. Probe Execution (continuous) ││ • Startup probe → Liveness probe → Readiness probe ││ • Update container status based on results │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ 8. Status Reporting ││ • Update Pod status in API server ││ • Report node conditions and capacity │└─────────────────────────────────────────────────────────────────┘Container Runtime Interface (CRI):
kubelet doesn't run containers directly—it delegates to a container runtime via CRI. This abstraction allows swapping runtimes:
| Runtime | Description | Use Case |
|---|---|---|
| containerd | Industry-standard runtime, graduated CNCF project | Default for most distributions |
| CRI-O | Lightweight runtime optimized for Kubernetes | OpenShift, minimalist setups |
| Docker (via cri-dockerd) | Docker Engine with CRI shim | Legacy compatibility |
Pod Eviction:
When node resources are exhausted, kubelet evicts Pods based on priority:
If kubelet fails, the node stops reporting status. After the node-monitor-grace-period (default 40s), the node is marked NotReady. After pod-eviction-timeout (default 5m), Pods are evicted to other nodes. Monitor kubelet health carefully—a silent kubelet failure can leave you with phantom nodes.
kube-proxy runs on every node and implements the Kubernetes Service concept. When you create a Service, kube-proxy configures the node's networking rules to direct traffic to the appropriate Pods.
What kube-proxy Does:
Operating Modes:
kube-proxy can operate in different modes with significant performance implications:
| Mode | Mechanism | Performance | When to Use |
|---|---|---|---|
| iptables | Linux iptables rules for NAT/forwarding | Good for <1000 services | Default, widely compatible |
| IPVS | Linux IPVS (IP Virtual Server) for load balancing | Better scalability, O(1) lookup | Large clusters with many services |
| userspace (legacy) | Traffic proxied through userspace process | Poor, high latency | Deprecated, avoid |
How Service Traffic Flows (iptables mode):
When a Pod accesses a ClusterIP service:
10.96.0.1:80 (service ClusterIP)10.244.1.5:8080)IPVS Mode Advantages:
IPVS uses hash tables for O(1) rule lookup versus O(n) iptables chain traversal:
Some CNI plugins (Cilium, Calico eBPF mode) can replace kube-proxy entirely. They implement service load balancing using eBPF for even better performance and additional features like session affinity and native load balancing without NAT.
The container runtime is the software responsible for actually running containers. While often overlooked, choosing and understanding your container runtime affects security, performance, and compatibility.
The OCI Standard:
The Open Container Initiative (OCI) defines standards for container images and runtimes:
OCI compliance means different runtimes are interchangeable at the low level.
Runtime Layers:
Container runtimes exist in layers:
kubelet
↓ (CRI - Container Runtime Interface)
High-Level Runtime (containerd, CRI-O)
↓ (OCI Runtime Spec)
Low-Level Runtime (runc, crun, kata-runtime)
↓
Linux Kernel (namespaces, cgroups, seccomp)
| Runtime | Type | Key Characteristics | Best For |
|---|---|---|---|
| containerd | High-level | CNCF graduated, Docker-extracted, production-proven | General Kubernetes deployments |
| CRI-O | High-level | Kubernetes-focused, minimal, stable | OpenShift, security-focused deployments |
| runc | Low-level (OCI) | Reference implementation, standard Linux containers | Default low-level runtime |
| crun | Low-level (OCI) | Written in C, faster startup than runc | Performance-sensitive workloads |
| gVisor (runsc) | Low-level (OCI) | User-space kernel for isolation | Untrusted/multi-tenant workloads |
| Kata Containers | Low-level (OCI) | Lightweight VMs for containers | Strong isolation requirements |
containerd Architecture:
containerd (the most common high-level runtime) provides:
Security Runtimes:
For enhanced isolation (multi-tenant clusters, untrusted code), specialized runtimes provide stronger boundaries:
Kubernetes RuntimeClass lets you specify different runtimes per Pod. Run trusted workloads with runc for performance, untrusted workloads with gVisor for security—all in the same cluster.
Now that we understand each component individually, let's trace a complete workflow to see how they work together. We'll follow the lifecycle of a Deployment from creation to running Pods.
Step-by-Step: Creating a Deployment
Complete Workflow: kubectl apply -f deployment.yaml ┌──────────────────────────────────────────────────────────────────┐│ 1. kubectl sends Deployment to API Server ││ POST /apis/apps/v1/namespaces/default/deployments │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ 2. API Server authenticates, authorizes, runs admission ││ webhooks, then persists Deployment to etcd │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ 3. Deployment Controller sees new Deployment (via watch) ││ Creates ReplicaSet with pod template │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ 4. ReplicaSet Controller sees new ReplicaSet ││ Creates N Pod objects (pods have no nodeName yet) │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ 5. Scheduler sees unscheduled Pods ││ Runs filter/score, assigns each Pod to a node │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ 6. kubelet on each assigned node sees Pod ││ Pulls images, mounts volumes, creates containers │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ 7. Container Runtime (containerd) runs containers ││ OS-level isolation via namespaces, cgroups │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ 8. kubelet updates Pod status → API Server → etcd ││ Pod shows as Running │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ 9. Endpoint Controller sees Running Pods ││ Updates Endpoints for Services selecting these Pods │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ 10. kube-proxy sees updated Endpoints ││ Updates iptables/IPVS rules on all nodes │└──────────────────────────────────────────────────────────────────┘ ↓┌──────────────────────────────────────────────────────────────────┐│ RESULT: Traffic to Service ClusterIP reaches Pods 🎉 │└──────────────────────────────────────────────────────────────────┘Key Observations:
etcd is touched only by the API server: All other components communicate via API server watches
Controllers are event-driven but level-triggered: They react to changes but always reconcile based on current state, not event history
No component directly calls another: Communication is via shared state in etcd, accessed through the API server
Asynchronous by design: Each controller operates independently; there's no synchronous orchestration
Self-healing emerges from reconciliation: If a Pod dies, ReplicaSet controller notices and creates a replacement—no central coordinator needed
We've explored every core component of the Kubernetes architecture. Let's consolidate the key takeaways:
What's Next:
Now that you understand the components, the next page dives into the core Kubernetes objects—Pods, Deployments, and Services. You'll see how these abstractions build on the component architecture to provide powerful application management capabilities.
You now have a comprehensive understanding of Kubernetes component architecture. This foundation will inform your ability to debug cluster issues, design resilient deployments, and make informed decisions about Kubernetes configurations. Next, we'll explore how Pods, Deployments, and Services work together.