Loading learning content...
Running containers at scale presents challenges remarkably similar to those facing operating systems: scheduling workloads onto available resources, managing process lifecycle, handling failures, providing networking and storage abstractions, and ensuring security isolation. Container orchestration systems address these challenges for distributed containerized applications.
Kubernetes has emerged as the dominant container orchestration platform, serving as the "operating system" for cloud-native applications. Understanding Kubernetes is essential for anyone building or operating modern cloud systems.
By completing this page, you will understand Kubernetes architecture and its operating system analogies, the scheduling algorithms that place containers on nodes, networking and storage abstractions, and production deployment patterns for resilient applications.
Kubernetes follows a declarative, controller-based architecture. Users specify desired state, and controllers continuously work to achieve and maintain that state. This model mirrors how operating systems manage resources through kernel subsystems.
The Kubernetes Architecture:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ CONTROL PLANE │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ API SERVER (kube-apiserver) │ │ │
│ │ │ - RESTful API for all cluster operations │ │ │
│ │ │ - Authentication, Authorization, Admission Control │ │ │
│ │ │ - Validation and persistence to etcd │ │ │
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────────┼─────────────────────┐ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌───────────┐ ┌─────────────────┐ ┌──────────────────┐ │ │
│ │ │ etcd │ │ Scheduler │ │Controller Manager│ │ │
│ │ │ │ │(kube-scheduler) │ │ │ │ │
│ │ │ Key-Value │ │ │ │ - Node Controller│ │ │
│ │ │ Store │ │ - Pod Placement │ │ - Replication │ │ │
│ │ │ │ │ - Resource Fit │ │ - Service/Endpoints│ │ │
│ │ │ Cluster │ │ - Affinity/Anti │ │ - Namespace │ │ │
│ │ │ State │ │ │ │ │ │ │
│ │ └───────────┘ └─────────────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────────┴───────────────────────────────────┐ │
│ │ WORKER NODES │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ ┌────────────────┐ │ │
│ │ │ NODE 1 │ │ NODE 2 │ │ NODE N │ │ │
│ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌────────────┐ │ │ │
│ │ │ │ kubelet │ │ │ │ kubelet │ │ │ │ kubelet │ │ │ │
│ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ └────────────┘ │ │ │
│ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌────────────┐ │ │ │
│ │ │ │ kube-proxy │ │ │ │ kube-proxy │ │ │ │ kube-proxy │ │ │ │
│ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ └────────────┘ │ │ │
│ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌────────────┐ │ │ │
│ │ │ │Container Runtime│ │ │ │Container Runtime│ │ │ │Container │ │ │ │
│ │ │ │(containerd/CRI-O)│ │ │ │(containerd/CRI-O)│ │ │ │Runtime │ │ │ │
│ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ └────────────┘ │ │ │
│ │ │ ┌──┐ ┌──┐ ┌──┐ │ │ ┌──┐ ┌──┐ ┌──┐ │ │ ┌──┐ ┌──┐ │ │ │
│ │ │ │P1│ │P2│ │P3│ │ │ │P4│ │P5│ │P6│ │ │ │P7│ │P8│ │ │ │
│ │ │ └──┘ └──┘ └──┘ │ │ └──┘ └──┘ └──┘ │ │ └──┘ └──┘ │ │ │
│ │ └─────────────────────┘ └─────────────────────┘ └────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
| Kubernetes Component | Operating System Analogy | Primary Responsibility |
|---|---|---|
| API Server | System call interface | All cluster interactions go through API Server |
| etcd | Registry / Configuration database | Persistent storage of cluster state |
| Scheduler | Process scheduler (CPU scheduler) | Assigns pods to nodes based on resources |
| Controller Manager | Kernel subsystems | Ensures desired state equals actual state |
| kubelet | init/systemd process manager | Manages pod lifecycle on each node |
| kube-proxy | Network stack / iptables | Implements service networking rules |
| Container Runtime | Process execution (exec syscall) | Actually runs containers (containerd, CRI-O) |
Unlike imperative systems where you issue commands ('start container X'), Kubernetes is declarative: you specify 'I want 3 replicas of container X running', and controllers work continuously to make reality match desire. This enables self-healing—if a container dies, controllers automatically create a replacement.
Kubernetes represents all resources as API objects stored in etcd. Understanding these objects is essential for working with Kubernetes.
Pods: The Atomic Unit
A Pod is one or more containers that share:
┌─────────────────────────────────────────────────────────────┐
│ POD │
│ (IP: 10.244.1.5) │
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Main Container │ │ Sidecar Container │ │
│ │ │ │ │ │
│ │ (Application) │ │ (Logging Agent) │ │
│ │ │ │ │ │
│ │ Port 8080 ─────────┼──┼─► Port 9090 │ │
│ │ │ │ │ │
│ │ localhost:9090 │◄─┼─── │ │
│ │ (can reach sidecar)│ │ │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │ │
│ ┌─────────────────┴─────────────────┐ │
│ │ Shared Volume │ │
│ │ (EmptyDir, PVC, ConfigMap) │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Why Pods, Not Just Containers?
Workload Controllers:
Controllers manage pod lifecycle and ensure desired state:
Deployment:
StatefulSet:
DaemonSet:
Job and CronJob:
Pods access services via DNS: service-name.namespace.svc.cluster.local. This decouples consumers from provider pod IPs. When pods die and recreate (potentially on different nodes with different IPs), service DNS continues to work—pointing to the new pod IPs via endpoints.
The Kubernetes scheduler is responsible for assigning unscheduled pods to nodes. This is the cloud-scale analog of a process scheduler, with similar concerns: resource matching, fairness, and optimization.
Scheduling Algorithm:
The scheduler operates in two phases:
1. Filtering (Predicates): Eliminates nodes that cannot run the pod:
2. Scoring (Priorities): Ranks remaining nodes from 0-100:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ SCHEDULER DECISION FLOW │
│ │
│ Unscheduled Pod │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ FILTERING PHASE │ │
│ │ │ │
│ │ All Nodes: [node-1, node-2, node-3, node-4, node-5, node-6] │ │
│ │ │ │ │
│ │ PodFitsResources ──────►│ [node-1, node-2, node-3, node-5, node-6] │ │
│ │ (node-4 lacks memory) │ │ │
│ │ │ │ │
│ │ NodeSelector ──────────►│ [node-1, node-2, node-5] │ │
│ │ (requires zone=us-west) │ │ │
│ │ │ │ │
│ │ Taints/Tolerations ────►│ [node-1, node-5] │ │
│ │ (node-2 has NoSchedule) │ │ │
│ └───────────────────────────┴───────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ SCORING PHASE │ │
│ │ │ │
│ │ Feasible Nodes: [node-1, node-5] │ │
│ │ │ │
│ │ LeastRequested: node-1=60, node-5=80 │ │
│ │ BalancedAlloc: node-1=70, node-5=50 │ │
│ │ NodeAffinity: node-1=100, node-5=100 │ │
│ │ ────────────────────────────────────── │ │
│ │ Total: node-1=230, node-5=230 │ │
│ │ │ │
│ │ Tiebreaker: Random selection → node-5 │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Pod scheduled to node-5 │
└─────────────────────────────────────────────────────────────────────────────────┘
Resource Requests and Limits:
Pods specify resource requirements that inform scheduling and enforcement:
Requests:
Limits:
resources:
requests:
cpu: "500m" # 0.5 CPU cores guaranteed
memory: "256Mi" # 256 MiB guaranteed
limits:
cpu: "1000m" # Can burst to 1.0 CPU cores
memory: "512Mi" # Hard limit; OOM killed if exceeded
Quality of Service (QoS) Classes:
When memory pressure occurs, BestEffort pods are evicted first, then Burstable, then Guaranteed.
The sum of limits can exceed node capacity (overcommit), while sum of requests cannot. If all pods simultaneously use their limits, the node becomes overloaded. Set limits thoughtfully based on actual usage patterns, and avoid extreme overcommit ratios.
Kubernetes provides sophisticated mechanisms for controlling pod placement beyond basic resource matching.
Node Affinity:
Defines where pods prefer or require to be scheduled based on node labels:
┌─────────────────────────────────────────────────────────────────┐
│ NODE AFFINITY TYPES │
│ │
│ requiredDuringSchedulingIgnoredDuringExecution (HARD) │
│ ──────────────────────────────────────────────── │
│ Pod MUST be placed on matching node or remains unscheduled │
│ │
│ preferredDuringSchedulingIgnoredDuringExecution (SOFT) │
│ ──────────────────────────────────────────────── │
│ Scheduler prefers matching nodes but will use others if needed │
│ Weight (1-100) determines preference strength │
│ │
│ Example Use Cases: │
│ - GPU workloads require nodes with hardware.nvidia.com/gpu │
│ - Prefer high-memory nodes for in-memory databases │
│ - Require specific availability zone for data locality │
└─────────────────────────────────────────────────────────────────┘
Pod Affinity and Anti-Affinity:
Defines where pods should be placed relative to other pods:
Pod Affinity:
Pod Anti-Affinity:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ POD ANTI-AFFINITY EXAMPLE │
│ │
│ Requirement: Spread database replicas across zones for HA │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Zone: us-1a │ │ Zone: us-1b │ │ Zone: us-1c │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │ DB Pod │ │ │ │ DB Pod │ │ │ │ DB Pod │ │ │
│ │ │ (Primary)│ │ │ │(Replica)│ │ │ │(Replica)│ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │
│ │ │ │ │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Anti-affinity rule: topologyKey=topology.kubernetes.io/zone │
│ Effect: No two DB pods on same zone │
└─────────────────────────────────────────────────────────────────────────────────┘
Taints and Tolerations:
Taints mark nodes as "off-limits" unless pods explicitly tolerate them:
Taint Effects:
Common Use Cases:
# Taint a node
kubectl taint nodes gpu-node nvidia.com/gpu=true:NoSchedule
# Pod must have matching toleration
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Use node affinity to select node pools (GPU, high-memory), pod anti-affinity to spread replicas across failure domains, and taints to reserve specialized nodes. Together, these provide fine-grained control over pod placement for both performance and availability.
Kubernetes networking implements a flat network model where every pod can communicate with every other pod without NAT. This simplifies application networking but requires sophisticated network implementations.
The Kubernetes Network Model:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES NETWORK MODEL │
│ │
│ Requirements: │
│ 1. All pods can communicate without NAT │
│ 2. All nodes can communicate with all pods without NAT │
│ 3. The IP a pod sees for itself is the IP others see for it │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ POD NETWORK (10.244.0.0/16) │ │
│ │ │ │
│ │ Node 1 (10.244.1.0/24) Node 2 (10.244.2.0/24) │ │
│ │ ┌───────────────────┐ ┌───────────────────┐ │ │
│ │ │ Pod A: 10.244.1.5 │────►│ Pod C: 10.244.2.3 │ │ │
│ │ │ Pod B: 10.244.1.8 │ │ Pod D: 10.244.2.7 │ │ │
│ │ └───────────────────┘ └───────────────────┘ │ │
│ │ │ ▲ │ │
│ │ └─────────────────────────┘ │ │
│ │ (Direct pod-to-pod, no NAT) │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ SERVICE NETWORK (10.96.0.0/12) │ │
│ │ │ │
│ │ ClusterIP: 10.96.45.23 (my-service) │ │
│ │ │ │ │
│ │ ▼ (kube-proxy rules) │ │
│ │ Endpoints: [10.244.1.5:8080, 10.244.2.3:8080] │ │
│ │ │ │ │
│ │ ▼ (load balanced) │ │
│ │ Selected Pod │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
| Plugin | Network Type | Key Features | Use Case |
|---|---|---|---|
| Calico | Layer 3 (BGP) | Network policy, high performance | Production clusters, policy-heavy |
| Cilium | eBPF-based | Deep visibility, Kubernetes-native security | Security-focused, observability |
| Flannel | Overlay (VXLAN) | Simple, lightweight | Development, simple deployments |
| Weave Net | Overlay + encryption | Encryption, mesh networking | Multi-cloud, security requirements |
| AWS VPC CNI | Native VPC | Full VPC integration, ENI per pod | AWS EKS deployments |
| Azure CNI | Native VNet | Azure VNet integration | Azure AKS deployments |
Network Policies:
Network Policies are firewall rules for pod traffic, implementing microsegmentation:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: db-policy
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: backend
ports:
- protocol: TCP
port: 5432
Effect: Only pods with label app=backend can connect to database pods on port 5432. All other ingress traffic is denied.
Important: Network Policies require a CNI that supports them (Calico, Cilium, Weave). Flannel does NOT enforce Network Policies.
By default, all pod-to-pod communication is allowed. Once you apply any NetworkPolicy selecting a pod, that pod enters an isolated mode where only explicitly allowed traffic is permitted. Start with deny-all policies and explicitly allow required communication paths.
Kubernetes provides storage abstractions that decouple pods from underlying storage implementations, much like how an operating system provides a filesystem abstraction over block devices.
Storage Abstraction Hierarchy:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES STORAGE ABSTRACTION │
│ │
│ Developer/User Level │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ PERSISTENT VOLUME CLAIM (PVC) │ │
│ │ "I need 10Gi of fast storage with ReadWriteOnce access" │ │
│ │ - Storage request from application perspective │ │
│ │ - Abstracts underlying provider │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ Binding │
│ ▼ │
│ Cluster Admin Level │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ PERSISTENT VOLUME (PV) │ │
│ │ "10Gi volume on AWS EBS gp3, ReadWriteOnce, Delete reclaim" │ │
│ │ - Actual storage resource provisioned in cluster │ │
│ │ - Can be pre-provisioned or dynamically created │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ Provisioning │
│ ▼ │
│ Infrastructure Level │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ STORAGE CLASS │ │
│ │ "gp3-fast: use aws-ebs provisioner, type=gp3, iops=3000" │ │
│ │ - Defines HOW to provision storage │ │
│ │ - References Container Storage Interface (CSI) driver │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ CSI Driver │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ UNDERLYING STORAGE PLATFORM │ │
│ │ AWS EBS | GCP PD | Azure Disk | NFS | Ceph | ... │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
Container Storage Interface (CSI):
CSI standardizes how storage systems integrate with Kubernetes:
CSI Driver Components:
Volume Types:
| Type | Lifetime | Use Case |
|---|---|---|
| EmptyDir | Pod lifetime | Scratch space, caching |
| HostPath | Node lifetime | Node-level storage (dangerous in production) |
| PersistentVolumeClaim | Beyond pod | Databases, stateful applications |
| ConfigMap/Secret | Cluster lifetime | Configuration injection |
| Projected | Pod lifetime | Combine multiple sources into one mount |
When using StatefulSets, each replica gets its own PVC via volumeClaimTemplates. Deleting a StatefulSet does NOT delete associated PVCs/PVs—data persists. When scaling down, PVCs remain for potential scale-up. Set appropriate reclaimPolicy (Delete or Retain) based on data importance.
Operating Kubernetes in production requires patterns for reliability, observability, and operational efficiency.
High Availability Patterns:
1. Control Plane HA:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ HIGHLY AVAILABLE CONTROL PLANE │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Control Plane 1 │ │ Control Plane 2 │ │ Control Plane 3 │ │
│ │ │ │ │ │ │ │
│ │ API Server │ │ API Server │ │ API Server │ │
│ │ Scheduler │ │ Scheduler │ │ Scheduler │ │
│ │ Controller Mgr │ │ Controller Mgr │ │ Controller Mgr │ │
│ │ │ │ │ │ │ │
│ │ etcd member │ │ etcd member │ │ etcd member │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ │ │
│ ┌────────────┴────────────┐ │
│ │ LOAD BALANCER │ │
│ │ (API Server endpoint) │ │
│ └─────────────────────────┘ │
│ │
│ - 3+ control plane nodes (odd number for etcd quorum) │
│ - Leader election for Scheduler and Controller Manager │
│ - Load balancer in front of API Servers │
└─────────────────────────────────────────────────────────────────────────────────┘
2. Pod Disruption Budgets (PDB):
Ensure enough replicas remain during voluntary disruptions:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 2 # At least 2 pods must remain
# OR
maxUnavailable: 1 # At most 1 pod can be down
selector:
matchLabels:
app: web
Effect: During node drain or upgrades, Kubernetes won't evict pods if it would violate the PDB.
3. Topology Spread Constraints:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
Effect: Pods spread evenly across availability zones with max difference of 1 between zones.
Use GitOps (ArgoCD, Flux) to manage Kubernetes manifests. All cluster state defined in Git repositories. Changes via pull requests with review. Automatic sync from Git to cluster. This provides audit trails, rollback capability, and consistent deployment processes.
Kubernetes serves as the operating system for cloud-native applications, providing resource management, scheduling, networking, and storage abstractions at cluster scale.
Looking Ahead:
We've covered container orchestration in depth. The final page explores Cloud OS Considerations—how operating systems adapt to cloud environments, including optimizations for virtualized workloads, container-optimized distributions, and cloud-native security models.
You now possess comprehensive knowledge of Kubernetes architecture, scheduling, networking, storage, and production deployment patterns. Next, we'll explore how operating systems themselves are evolving to meet cloud computing requirements.