Operating SystemsCloud Computing

Cloud Computing

LevelAdvanced

Duration75 mins

TopicCloud Computing

4 / 5

Container Orchestration (Kubernetes)

The Orchestration Challenge

Running containers at scale presents challenges remarkably similar to those facing operating systems: scheduling workloads onto available resources, managing process lifecycle, handling failures, providing networking and storage abstractions, and ensuring security isolation. Container orchestration systems address these challenges for distributed containerized applications.

Kubernetes has emerged as the dominant container orchestration platform, serving as the "operating system" for cloud-native applications. Understanding Kubernetes is essential for anyone building or operating modern cloud systems.

What You Will Master

By completing this page, you will understand Kubernetes architecture and its operating system analogies, the scheduling algorithms that place containers on nodes, networking and storage abstractions, and production deployment patterns for resilient applications.

Kubernetes Architecture Overview

Kubernetes follows a declarative, controller-based architecture. Users specify desired state, and controllers continuously work to achieve and maintain that state. This model mirrors how operating systems manage resources through kernel subsystems.

The Kubernetes Architecture:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                            KUBERNETES CLUSTER                                   │
│                                                                                 │
│  ┌─────────────────────────────────────────────────────────────────────────┐   │
│  │                         CONTROL PLANE                                     │   │
│  │  ┌─────────────────────────────────────────────────────────────────────┐ │   │
│  │  │                       API SERVER (kube-apiserver)                    │ │   │
│  │  │   - RESTful API for all cluster operations                          │ │   │
│  │  │   - Authentication, Authorization, Admission Control                 │ │   │
│  │  │   - Validation and persistence to etcd                              │ │   │
│  │  └─────────────────────────────────────────────────────────────────────┘ │   │
│  │                              │                                           │   │
│  │        ┌─────────────────────┼─────────────────────┐                    │   │
│  │        ▼                     ▼                     ▼                    │   │
│  │  ┌───────────┐    ┌─────────────────┐    ┌──────────────────┐          │   │
│  │  │   etcd    │    │    Scheduler    │    │Controller Manager│          │   │
│  │  │           │    │(kube-scheduler) │    │                  │          │   │
│  │  │ Key-Value │    │                 │    │ - Node Controller│          │   │
│  │  │   Store   │    │ - Pod Placement │    │ - Replication    │          │   │
│  │  │           │    │ - Resource Fit  │    │ - Service/Endpoints│        │   │
│  │  │ Cluster   │    │ - Affinity/Anti │    │ - Namespace      │          │   │
│  │  │  State    │    │                 │    │                  │          │   │
│  │  └───────────┘    └─────────────────┘    └──────────────────┘          │   │
│  └─────────────────────────────────────────────────────────────────────────┘   │
│                                      │                                         │
│  ┌───────────────────────────────────┴───────────────────────────────────┐     │
│  │                          WORKER NODES                                  │     │
│  │  ┌─────────────────────┐  ┌─────────────────────┐  ┌────────────────┐ │     │
│  │  │      NODE 1         │  │      NODE 2         │  │     NODE N     │ │     │
│  │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌────────────┐ │ │     │
│  │  │ │    kubelet      │ │  │ │    kubelet      │ │  │ │  kubelet   │ │ │     │
│  │  │ └─────────────────┘ │  │ └─────────────────┘ │  │ └────────────┘ │ │     │
│  │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌────────────┐ │ │     │
│  │  │ │   kube-proxy    │ │  │ │   kube-proxy    │ │  │ │ kube-proxy │ │ │     │
│  │  │ └─────────────────┘ │  │ └─────────────────┘ │  │ └────────────┘ │ │     │
│  │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌────────────┐ │ │     │
│  │  │ │Container Runtime│ │  │ │Container Runtime│ │  │ │Container   │ │ │     │
│  │  │ │(containerd/CRI-O)│ │  │ │(containerd/CRI-O)│ │  │ │Runtime     │ │ │     │
│  │  │ └─────────────────┘ │  │ └─────────────────┘ │  │ └────────────┘ │ │     │
│  │  │ ┌──┐ ┌──┐ ┌──┐     │  │ ┌──┐ ┌──┐ ┌──┐     │  │ ┌──┐ ┌──┐    │ │     │
│  │  │ │P1│ │P2│ │P3│     │  │ │P4│ │P5│ │P6│     │  │ │P7│ │P8│    │ │     │
│  │  │ └──┘ └──┘ └──┘     │  │ └──┘ └──┘ └──┘     │  │ └──┘ └──┘    │ │     │
│  │  └─────────────────────┘  └─────────────────────┘  └────────────────┘ │     │
│  └───────────────────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────────────────┘

Kubernetes Components: OS Analogies
Kubernetes Component	Operating System Analogy	Primary Responsibility
API Server	System call interface	All cluster interactions go through API Server
etcd	Registry / Configuration database	Persistent storage of cluster state
Scheduler	Process scheduler (CPU scheduler)	Assigns pods to nodes based on resources
Controller Manager	Kernel subsystems	Ensures desired state equals actual state
kubelet	init/systemd process manager	Manages pod lifecycle on each node
kube-proxy	Network stack / iptables	Implements service networking rules
Container Runtime	Process execution (exec syscall)	Actually runs containers (containerd, CRI-O)

The Declarative Model

Unlike imperative systems where you issue commands ('start container X'), Kubernetes is declarative: you specify 'I want 3 replicas of container X running', and controllers work continuously to make reality match desire. This enables self-healing—if a container dies, controllers automatically create a replacement.

Core Kubernetes Objects

Kubernetes represents all resources as API objects stored in etcd. Understanding these objects is essential for working with Kubernetes.

Pods: The Atomic Unit

A Pod is one or more containers that share:

Network namespace (same IP address, can communicate via localhost)
Storage volumes
Lifecycle (created and destroyed together)

┌─────────────────────────────────────────────────────────────┐
│                          POD                                │
│                    (IP: 10.244.1.5)                         │
│                                                             │
│  ┌─────────────────────┐  ┌─────────────────────┐          │
│  │   Main Container    │  │  Sidecar Container  │          │
│  │                     │  │                     │          │
│  │   (Application)     │  │   (Logging Agent)   │          │
│  │                     │  │                     │          │
│  │  Port 8080 ─────────┼──┼─► Port 9090         │          │
│  │                     │  │                     │          │
│  │  localhost:9090     │◄─┼───                  │          │
│  │  (can reach sidecar)│  │                     │          │
│  └─────────────────────┘  └─────────────────────┘          │
│                    │                                        │
│  ┌─────────────────┴─────────────────┐                     │
│  │         Shared Volume             │                     │
│  │    (EmptyDir, PVC, ConfigMap)     │                     │
│  └───────────────────────────────────┘                     │
└─────────────────────────────────────────────────────────────┘

Why Pods, Not Just Containers?

Co-located processes that must share resources
Sidecar patterns: logging, proxying, configuration
Init containers that run before the main container
Tightly coupled processes that would use IPC on a single machine

Workload Controllers:

Controllers manage pod lifecycle and ensure desired state:

Deployment:

Manages ReplicaSets for stateless applications
Enables rolling updates and rollbacks
Specifies replica count, update strategy, pod template

StatefulSet:

For stateful applications (databases, distributed systems)
Stable network identities (pod-0, pod-1, pod-2)
Ordered deployment and scaling
Persistent storage per pod

DaemonSet:

Run one pod per node (logging agents, monitoring, CNI plugins)
Automatically deployed to new nodes

Job and CronJob:

Run-to-completion workloads (batch processing, data pipelines)
CronJob schedules Jobs on a time-based schedule

Kubernetes Service Types

•ClusterIP — Internal-only IP address; accessible only within the cluster. Default service type. Used for internal microservice communication.
•NodePort — Exposes service on each node's IP at a static port (30000-32767). External traffic can reach NodeIP:NodePort.
•LoadBalancer — Provisions cloud provider's load balancer (ELB, Cloud Load Balancer). Automatically creates external IP.
•ExternalName — Maps service to external DNS name. Returns CNAME record; no proxying involved.
•Headless Service — ClusterIP set to None; returns pod IPs directly. Used for StatefulSets requiring stable network identities.

Service Discovery Pattern

Pods access services via DNS: service-name.namespace.svc.cluster.local. This decouples consumers from provider pod IPs. When pods die and recreate (potentially on different nodes with different IPs), service DNS continues to work—pointing to the new pod IPs via endpoints.

Kubernetes Scheduler Deep Dive

The Kubernetes scheduler is responsible for assigning unscheduled pods to nodes. This is the cloud-scale analog of a process scheduler, with similar concerns: resource matching, fairness, and optimization.

Scheduling Algorithm:

The scheduler operates in two phases:

1. Filtering (Predicates): Eliminates nodes that cannot run the pod:

PodFitsResources: Node has requested CPU and memory
PodFitsHostPorts: Required host ports are available
NodeSelector: Node matches label requirements
PodToleratesTaints: Pod tolerates node taints
NoVolumeZoneConflict: Volumes are in compatible zones

2. Scoring (Priorities): Ranks remaining nodes from 0-100:

LeastRequestedPriority: Prefers nodes with more available resources
BalancedResourceAllocation: Prefers balanced CPU/memory utilization
NodeAffinityPriority: Prefers nodes matching soft affinity rules
InterPodAffinityPriority: Prefers co-location with specific pods
SelectorSpreadPriority: Spreads pods across failure domains

┌─────────────────────────────────────────────────────────────────────────────────┐
│                        SCHEDULER DECISION FLOW                                  │
│                                                                                 │
│   Unscheduled Pod                                                               │
│        │                                                                        │
│        ▼                                                                        │
│  ┌───────────────────────────────────────────────────────────────────────┐     │
│  │                    FILTERING PHASE                                     │     │
│  │                                                                        │     │
│  │   All Nodes: [node-1, node-2, node-3, node-4, node-5, node-6]         │     │
│  │                           │                                            │     │
│  │   PodFitsResources ──────►│ [node-1, node-2, node-3, node-5, node-6]   │     │
│  │   (node-4 lacks memory)   │                                            │     │
│  │                           │                                            │     │
│  │   NodeSelector ──────────►│ [node-1, node-2, node-5]                   │     │
│  │   (requires zone=us-west) │                                            │     │
│  │                           │                                            │     │
│  │   Taints/Tolerations ────►│ [node-1, node-5]                           │     │
│  │   (node-2 has NoSchedule) │                                            │     │
│  └───────────────────────────┴───────────────────────────────────────────┘     │
│                                  │                                              │
│                                  ▼                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐     │
│  │                    SCORING PHASE                                       │     │
│  │                                                                        │     │
│  │   Feasible Nodes: [node-1, node-5]                                    │     │
│  │                                                                        │     │
│  │   LeastRequested: node-1=60, node-5=80                                │     │
│  │   BalancedAlloc:  node-1=70, node-5=50                                │     │
│  │   NodeAffinity:   node-1=100, node-5=100                              │     │
│  │   ──────────────────────────────────────                              │     │
│  │   Total:          node-1=230, node-5=230                              │     │
│  │                                                                        │     │
│  │   Tiebreaker: Random selection → node-5                               │     │
│  └───────────────────────────────────────────────────────────────────────┘     │
│                                  │                                              │
│                                  ▼                                              │
│   Pod scheduled to node-5                                                       │
└─────────────────────────────────────────────────────────────────────────────────┘

Resource Requests and Limits:

Pods specify resource requirements that inform scheduling and enforcement:

Requests:

Minimum resources guaranteed to the container
Used for scheduling decisions
Maps to cgroup 'shares' (CPU) and 'soft limit' (memory)

Limits:

Maximum resources container can use
Enforced at runtime by cgroups
CPU: throttling when exceeded
Memory: OOM killed when exceeded

resources:
  requests:
    cpu: "500m"      # 0.5 CPU cores guaranteed
    memory: "256Mi"  # 256 MiB guaranteed
  limits:
    cpu: "1000m"     # Can burst to 1.0 CPU cores
    memory: "512Mi"  # Hard limit; OOM killed if exceeded

Quality of Service (QoS) Classes:

Guaranteed: requests = limits for all containers
Burstable: requests < limits for some container
BestEffort: no requests or limits specified

When memory pressure occurs, BestEffort pods are evicted first, then Burstable, then Guaranteed.

Over-Provisioning Danger

The sum of limits can exceed node capacity (overcommit), while sum of requests cannot. If all pods simultaneously use their limits, the node becomes overloaded. Set limits thoughtfully based on actual usage patterns, and avoid extreme overcommit ratios.

Affinity, Anti-Affinity, and Taints

Kubernetes provides sophisticated mechanisms for controlling pod placement beyond basic resource matching.

Node Affinity:

Defines where pods prefer or require to be scheduled based on node labels:

┌─────────────────────────────────────────────────────────────────┐
│                      NODE AFFINITY TYPES                         │
│                                                                  │
│  requiredDuringSchedulingIgnoredDuringExecution (HARD)          │
│  ────────────────────────────────────────────────               │
│  Pod MUST be placed on matching node or remains unscheduled     │
│                                                                  │
│  preferredDuringSchedulingIgnoredDuringExecution (SOFT)         │
│  ────────────────────────────────────────────────               │
│  Scheduler prefers matching nodes but will use others if needed │
│  Weight (1-100) determines preference strength                  │
│                                                                  │
│  Example Use Cases:                                              │
│  - GPU workloads require nodes with hardware.nvidia.com/gpu     │
│  - Prefer high-memory nodes for in-memory databases             │
│  - Require specific availability zone for data locality         │
└─────────────────────────────────────────────────────────────────┘

Pod Affinity and Anti-Affinity:

Defines where pods should be placed relative to other pods:

Pod Affinity:

Co-locate pods with specific other pods
Example: Place web server pods on same node as cache pods for low latency
Topology key defines the failure domain (node, zone, region)

Pod Anti-Affinity:

Separate pods from specific other pods
Example: Spread database replicas across different failure domains
Critical for high availability

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    POD ANTI-AFFINITY EXAMPLE                                    │
│                                                                                 │
│   Requirement: Spread database replicas across zones for HA                     │
│                                                                                 │
│   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐           │
│   │   Zone: us-1a   │    │   Zone: us-1b   │    │   Zone: us-1c   │           │
│   │                 │    │                 │    │                 │           │
│   │   ┌─────────┐   │    │   ┌─────────┐   │    │   ┌─────────┐   │           │
│   │   │ DB Pod  │   │    │   │ DB Pod  │   │    │   │ DB Pod  │   │           │
│   │   │ (Primary)│   │    │   │(Replica)│   │    │   │(Replica)│   │           │
│   │   └─────────┘   │    │   └─────────┘   │    │   └─────────┘   │           │
│   │                 │    │                 │    │                 │           │
│   └─────────────────┘    └─────────────────┘    └─────────────────┘           │
│                                                                                 │
│   Anti-affinity rule: topologyKey=topology.kubernetes.io/zone                   │
│   Effect: No two DB pods on same zone                                           │
└─────────────────────────────────────────────────────────────────────────────────┘

Taints and Tolerations:

Taints mark nodes as "off-limits" unless pods explicitly tolerate them:

Taint Effects:

NoSchedule: Pods without toleration won't be scheduled
PreferNoSchedule: Soft version; avoid if possible
NoExecute: Existing pods without toleration are evicted

Common Use Cases:

Dedicated nodes for specific teams or workloads
GPU nodes only for GPU workloads
Master nodes should run only control plane components
Preventing scheduling during maintenance

# Taint a node
kubectl taint nodes gpu-node nvidia.com/gpu=true:NoSchedule

# Pod must have matching toleration
tolerations:
- key: "nvidia.com/gpu"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"

Combining Placement Controls

Use node affinity to select node pools (GPU, high-memory), pod anti-affinity to spread replicas across failure domains, and taints to reserve specialized nodes. Together, these provide fine-grained control over pod placement for both performance and availability.

Kubernetes Networking

Kubernetes networking implements a flat network model where every pod can communicate with every other pod without NAT. This simplifies application networking but requires sophisticated network implementations.

The Kubernetes Network Model:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    KUBERNETES NETWORK MODEL                                     │
│                                                                                 │
│   Requirements:                                                                 │
│   1. All pods can communicate without NAT                                       │
│   2. All nodes can communicate with all pods without NAT                        │
│   3. The IP a pod sees for itself is the IP others see for it                  │
│                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │                         POD NETWORK (10.244.0.0/16)                      │  │
│   │                                                                          │  │
│   │   Node 1 (10.244.1.0/24)    Node 2 (10.244.2.0/24)                      │  │
│   │   ┌───────────────────┐     ┌───────────────────┐                       │  │
│   │   │ Pod A: 10.244.1.5 │────►│ Pod C: 10.244.2.3 │                       │  │
│   │   │ Pod B: 10.244.1.8 │     │ Pod D: 10.244.2.7 │                       │  │
│   │   └───────────────────┘     └───────────────────┘                       │  │
│   │            │                         ▲                                   │  │
│   │            └─────────────────────────┘                                   │  │
│   │          (Direct pod-to-pod, no NAT)                                     │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │                      SERVICE NETWORK (10.96.0.0/12)                      │  │
│   │                                                                          │  │
│   │   ClusterIP: 10.96.45.23 (my-service)                                   │  │
│   │       │                                                                  │  │
│   │       ▼ (kube-proxy rules)                                               │  │
│   │   Endpoints: [10.244.1.5:8080, 10.244.2.3:8080]                         │  │
│   │       │                                                                  │  │
│   │       ▼ (load balanced)                                                  │  │
│   │   Selected Pod                                                           │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────────┘

Container Network Interface (CNI) Plugins
Plugin	Network Type	Key Features	Use Case
Calico	Layer 3 (BGP)	Network policy, high performance	Production clusters, policy-heavy
Cilium	eBPF-based	Deep visibility, Kubernetes-native security	Security-focused, observability
Flannel	Overlay (VXLAN)	Simple, lightweight	Development, simple deployments
Weave Net	Overlay + encryption	Encryption, mesh networking	Multi-cloud, security requirements
AWS VPC CNI	Native VPC	Full VPC integration, ENI per pod	AWS EKS deployments
Azure CNI	Native VNet	Azure VNet integration	Azure AKS deployments

Network Policies:

Network Policies are firewall rules for pod traffic, implementing microsegmentation:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: db-policy
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: backend
    ports:
    - protocol: TCP
      port: 5432

Effect: Only pods with label app=backend can connect to database pods on port 5432. All other ingress traffic is denied.

Important: Network Policies require a CNI that supports them (Calico, Cilium, Weave). Flannel does NOT enforce Network Policies.

Default Allow

By default, all pod-to-pod communication is allowed. Once you apply any NetworkPolicy selecting a pod, that pod enters an isolated mode where only explicitly allowed traffic is permitted. Start with deny-all policies and explicitly allow required communication paths.

Storage in Kubernetes

Kubernetes provides storage abstractions that decouple pods from underlying storage implementations, much like how an operating system provides a filesystem abstraction over block devices.

Storage Abstraction Hierarchy:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                      KUBERNETES STORAGE ABSTRACTION                              │
│                                                                                 │
│   Developer/User Level                                                          │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │              PERSISTENT VOLUME CLAIM (PVC)                                │  │
│   │   "I need 10Gi of fast storage with ReadWriteOnce access"                │  │
│   │   - Storage request from application perspective                          │  │
│   │   - Abstracts underlying provider                                         │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                    ▲                                            │
│                                    │ Binding                                    │
│                                    ▼                                            │
│   Cluster Admin Level                                                           │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │               PERSISTENT VOLUME (PV)                                      │  │
│   │   "10Gi volume on AWS EBS gp3, ReadWriteOnce, Delete reclaim"            │  │
│   │   - Actual storage resource provisioned in cluster                        │  │
│   │   - Can be pre-provisioned or dynamically created                         │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                    ▲                                            │
│                                    │ Provisioning                               │
│                                    ▼                                            │
│   Infrastructure Level                                                          │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │               STORAGE CLASS                                               │  │
│   │   "gp3-fast: use aws-ebs provisioner, type=gp3, iops=3000"               │  │
│   │   - Defines HOW to provision storage                                      │  │
│   │   - References Container Storage Interface (CSI) driver                   │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                    ▲                                            │
│                                    │ CSI Driver                                 │
│                                    ▼                                            │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │         UNDERLYING STORAGE PLATFORM                                       │  │
│   │   AWS EBS | GCP PD | Azure Disk | NFS | Ceph | ...                        │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────────┘

Storage Access Modes

•ReadWriteOnce (RWO) — Volume can be mounted read-write by a single node. Most block storage (EBS, GCP PD). Suitable for databases, single-instance applications.
•ReadOnlyMany (ROX) — Volume can be mounted read-only by many nodes. Useful for shared configuration, static assets.
•ReadWriteMany (RWX) — Volume can be mounted read-write by many nodes. Requires shared filesystem (NFS, EFS, CephFS). For shared data across multiple pods.
•ReadWriteOncePod (RWOP) — Volume can be mounted read-write by a single pod (Kubernetes 1.22+). Stricter than RWO for safety.

Container Storage Interface (CSI):

CSI standardizes how storage systems integrate with Kubernetes:

CSI Driver Components:

Controller Plugin: Runs in control plane; handles provisioning, snapshotting
Node Plugin: Runs on each node; handles mount/unmount operations
Identity Service: Reports driver capabilities

Volume Types:

Type	Lifetime	Use Case
EmptyDir	Pod lifetime	Scratch space, caching
HostPath	Node lifetime	Node-level storage (dangerous in production)
PersistentVolumeClaim	Beyond pod	Databases, stateful applications
ConfigMap/Secret	Cluster lifetime	Configuration injection
Projected	Pod lifetime	Combine multiple sources into one mount

StatefulSet Storage Considerations

When using StatefulSets, each replica gets its own PVC via volumeClaimTemplates. Deleting a StatefulSet does NOT delete associated PVCs/PVs—data persists. When scaling down, PVCs remain for potential scale-up. Set appropriate reclaimPolicy (Delete or Retain) based on data importance.

Production Deployment Patterns

Operating Kubernetes in production requires patterns for reliability, observability, and operational efficiency.

High Availability Patterns:

1. Control Plane HA:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    HIGHLY AVAILABLE CONTROL PLANE                               │
│                                                                                 │
│   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐               │
│   │ Control Plane 1 │  │ Control Plane 2 │  │ Control Plane 3 │               │
│   │                 │  │                 │  │                 │               │
│   │ API Server      │  │ API Server      │  │ API Server      │               │
│   │ Scheduler       │  │ Scheduler       │  │ Scheduler       │               │
│   │ Controller Mgr  │  │ Controller Mgr  │  │ Controller Mgr  │               │
│   │                 │  │                 │  │                 │               │
│   │ etcd member     │  │ etcd member     │  │ etcd member     │               │
│   └────────┬────────┘  └────────┬────────┘  └────────┬────────┘               │
│            │                    │                    │                         │
│            └────────────────────┼────────────────────┘                         │
│                                 │                                              │
│                    ┌────────────┴────────────┐                                 │
│                    │     LOAD BALANCER       │                                 │
│                    │   (API Server endpoint) │                                 │
│                    └─────────────────────────┘                                 │
│                                                                                 │
│   - 3+ control plane nodes (odd number for etcd quorum)                        │
│   - Leader election for Scheduler and Controller Manager                       │
│   - Load balancer in front of API Servers                                      │
└─────────────────────────────────────────────────────────────────────────────────┘

2. Pod Disruption Budgets (PDB):

Ensure enough replicas remain during voluntary disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2    # At least 2 pods must remain
  # OR
  maxUnavailable: 1  # At most 1 pod can be down
  selector:
    matchLabels:
      app: web

Effect: During node drain or upgrades, Kubernetes won't evict pods if it would violate the PDB.

3. Topology Spread Constraints:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: web

Effect: Pods spread evenly across availability zones with max difference of 1 between zones.

Production Best Practices

•Resource Requests/Limits — Always set for predictable scheduling and resource management
•Liveness/Readiness Probes — Define health checks for automatic recovery
•PodDisruptionBudgets — Protect availability during maintenance
•Namespace Isolation — Separate environments, teams, or applications
•RBAC — Least privilege access for users and service accounts

Observability Stack

•Prometheus — Metrics collection and alerting
•Grafana — Visualization and dashboards
•Jaeger/Tempo — Distributed tracing
•Loki/Elasticsearch — Log aggregation
•Kubernetes Dashboard / Lens — Cluster visibility

GitOps Approach

Use GitOps (ArgoCD, Flux) to manage Kubernetes manifests. All cluster state defined in Git repositories. Changes via pull requests with review. Automatic sync from Git to cluster. This provides audit trails, rollback capability, and consistent deployment processes.

Summary: Container Orchestration (Kubernetes)

Kubernetes serves as the operating system for cloud-native applications, providing resource management, scheduling, networking, and storage abstractions at cluster scale.

Key Takeaways

•Kubernetes is a distributed operating system — API Server is the syscall interface, Scheduler assigns pods to nodes, Controllers maintain desired state, kubelet manages pod lifecycle.
•Pods are the atomic deployment unit — Multiple containers sharing network/storage; managed by Deployments, StatefulSets, DaemonSets, and Jobs.
•Scheduler uses filtering then scoring — Predicates eliminate infeasible nodes; priorities rank remaining nodes for optimal placement.
•Resource requests guarantee scheduling; limits enforce caps — QoS classes (Guaranteed, Burstable, BestEffort) determine eviction priority under pressure.
•Affinity controls co-location — Node affinity selects node pools; pod anti-affinity spreads replicas; taints reserve dedicated nodes.
•Flat network model with CNI plugins — All pods can communicate; Network Policies implement microsegmentation; Services provide stable endpoints.
•Storage abstraction via PVC/PV/StorageClass — CSI drivers integrate storage platforms; StatefulSets provide stable storage per replica.
•Production patterns ensure reliability — HA control plane, PodDisruptionBudgets, topology spread, GitOps workflows.

Looking Ahead:

We've covered container orchestration in depth. The final page explores Cloud OS Considerations—how operating systems adapt to cloud environments, including optimizations for virtualized workloads, container-optimized distributions, and cloud-native security models.

Page Complete

You now possess comprehensive knowledge of Kubernetes architecture, scheduling, networking, storage, and production deployment patterns. Next, we'll explore how operating systems themselves are evolving to meet cloud computing requirements.

4 / 5

Loading learning content...

Operating SystemsCloud Computing

Cloud Computing

LevelAdvanced

Duration75 mins

TopicCloud Computing

4 / 5

Container Orchestration (Kubernetes)

The Orchestration Challenge

What You Will Master

Kubernetes Architecture Overview

The Kubernetes Architecture:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                            KUBERNETES CLUSTER                                   │
│                                                                                 │
│  ┌─────────────────────────────────────────────────────────────────────────┐   │
│  │                         CONTROL PLANE                                     │   │
│  │  ┌─────────────────────────────────────────────────────────────────────┐ │   │
│  │  │                       API SERVER (kube-apiserver)                    │ │   │
│  │  │   - RESTful API for all cluster operations                          │ │   │
│  │  │   - Authentication, Authorization, Admission Control                 │ │   │
│  │  │   - Validation and persistence to etcd                              │ │   │
│  │  └─────────────────────────────────────────────────────────────────────┘ │   │
│  │                              │                                           │   │
│  │        ┌─────────────────────┼─────────────────────┐                    │   │
│  │        ▼                     ▼                     ▼                    │   │
│  │  ┌───────────┐    ┌─────────────────┐    ┌──────────────────┐          │   │
│  │  │   etcd    │    │    Scheduler    │    │Controller Manager│          │   │
│  │  │           │    │(kube-scheduler) │    │                  │          │   │
│  │  │ Key-Value │    │                 │    │ - Node Controller│          │   │
│  │  │   Store   │    │ - Pod Placement │    │ - Replication    │          │   │
│  │  │           │    │ - Resource Fit  │    │ - Service/Endpoints│        │   │
│  │  │ Cluster   │    │ - Affinity/Anti │    │ - Namespace      │          │   │
│  │  │  State    │    │                 │    │                  │          │   │
│  │  └───────────┘    └─────────────────┘    └──────────────────┘          │   │
│  └─────────────────────────────────────────────────────────────────────────┘   │
│                                      │                                         │
│  ┌───────────────────────────────────┴───────────────────────────────────┐     │
│  │                          WORKER NODES                                  │     │
│  │  ┌─────────────────────┐  ┌─────────────────────┐  ┌────────────────┐ │     │
│  │  │      NODE 1         │  │      NODE 2         │  │     NODE N     │ │     │
│  │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌────────────┐ │ │     │
│  │  │ │    kubelet      │ │  │ │    kubelet      │ │  │ │  kubelet   │ │ │     │
│  │  │ └─────────────────┘ │  │ └─────────────────┘ │  │ └────────────┘ │ │     │
│  │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌────────────┐ │ │     │
│  │  │ │   kube-proxy    │ │  │ │   kube-proxy    │ │  │ │ kube-proxy │ │ │     │
│  │  │ └─────────────────┘ │  │ └─────────────────┘ │  │ └────────────┘ │ │     │
│  │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌────────────┐ │ │     │
│  │  │ │Container Runtime│ │  │ │Container Runtime│ │  │ │Container   │ │ │     │
│  │  │ │(containerd/CRI-O)│ │  │ │(containerd/CRI-O)│ │  │ │Runtime     │ │ │     │
│  │  │ └─────────────────┘ │  │ └─────────────────┘ │  │ └────────────┘ │ │     │
│  │  │ ┌──┐ ┌──┐ ┌──┐     │  │ ┌──┐ ┌──┐ ┌──┐     │  │ ┌──┐ ┌──┐    │ │     │
│  │  │ │P1│ │P2│ │P3│     │  │ │P4│ │P5│ │P6│     │  │ │P7│ │P8│    │ │     │
│  │  │ └──┘ └──┘ └──┘     │  │ └──┘ └──┘ └──┘     │  │ └──┘ └──┘    │ │     │
│  │  └─────────────────────┘  └─────────────────────┘  └────────────────┘ │     │
│  └───────────────────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────────────────┘

Kubernetes Components: OS Analogies
Kubernetes Component	Operating System Analogy	Primary Responsibility
API Server	System call interface	All cluster interactions go through API Server
etcd	Registry / Configuration database	Persistent storage of cluster state
Scheduler	Process scheduler (CPU scheduler)	Assigns pods to nodes based on resources
Controller Manager	Kernel subsystems	Ensures desired state equals actual state
kubelet	init/systemd process manager	Manages pod lifecycle on each node
kube-proxy	Network stack / iptables	Implements service networking rules
Container Runtime	Process execution (exec syscall)	Actually runs containers (containerd, CRI-O)

The Declarative Model

Core Kubernetes Objects

Kubernetes represents all resources as API objects stored in etcd. Understanding these objects is essential for working with Kubernetes.

Pods: The Atomic Unit

A Pod is one or more containers that share:

Network namespace (same IP address, can communicate via localhost)
Storage volumes
Lifecycle (created and destroyed together)

┌─────────────────────────────────────────────────────────────┐
│                          POD                                │
│                    (IP: 10.244.1.5)                         │
│                                                             │
│  ┌─────────────────────┐  ┌─────────────────────┐          │
│  │   Main Container    │  │  Sidecar Container  │          │
│  │                     │  │                     │          │
│  │   (Application)     │  │   (Logging Agent)   │          │
│  │                     │  │                     │          │
│  │  Port 8080 ─────────┼──┼─► Port 9090         │          │
│  │                     │  │                     │          │
│  │  localhost:9090     │◄─┼───                  │          │
│  │  (can reach sidecar)│  │                     │          │
│  └─────────────────────┘  └─────────────────────┘          │
│                    │                                        │
│  ┌─────────────────┴─────────────────┐                     │
│  │         Shared Volume             │                     │
│  │    (EmptyDir, PVC, ConfigMap)     │                     │
│  └───────────────────────────────────┘                     │
└─────────────────────────────────────────────────────────────┘

Why Pods, Not Just Containers?

Co-located processes that must share resources
Sidecar patterns: logging, proxying, configuration
Init containers that run before the main container
Tightly coupled processes that would use IPC on a single machine

Workload Controllers:

Controllers manage pod lifecycle and ensure desired state:

Deployment:

Manages ReplicaSets for stateless applications
Enables rolling updates and rollbacks
Specifies replica count, update strategy, pod template

StatefulSet:

For stateful applications (databases, distributed systems)
Stable network identities (pod-0, pod-1, pod-2)
Ordered deployment and scaling
Persistent storage per pod

DaemonSet:

Run one pod per node (logging agents, monitoring, CNI plugins)
Automatically deployed to new nodes

Job and CronJob:

Run-to-completion workloads (batch processing, data pipelines)
CronJob schedules Jobs on a time-based schedule

Kubernetes Service Types

•ClusterIP — Internal-only IP address; accessible only within the cluster. Default service type. Used for internal microservice communication.
•NodePort — Exposes service on each node's IP at a static port (30000-32767). External traffic can reach NodeIP:NodePort.
•LoadBalancer — Provisions cloud provider's load balancer (ELB, Cloud Load Balancer). Automatically creates external IP.
•ExternalName — Maps service to external DNS name. Returns CNAME record; no proxying involved.
•Headless Service — ClusterIP set to None; returns pod IPs directly. Used for StatefulSets requiring stable network identities.

Service Discovery Pattern

Kubernetes Scheduler Deep Dive

Scheduling Algorithm:

The scheduler operates in two phases:

1. Filtering (Predicates): Eliminates nodes that cannot run the pod:

PodFitsResources: Node has requested CPU and memory
PodFitsHostPorts: Required host ports are available
NodeSelector: Node matches label requirements
PodToleratesTaints: Pod tolerates node taints
NoVolumeZoneConflict: Volumes are in compatible zones

2. Scoring (Priorities): Ranks remaining nodes from 0-100:

LeastRequestedPriority: Prefers nodes with more available resources
BalancedResourceAllocation: Prefers balanced CPU/memory utilization
NodeAffinityPriority: Prefers nodes matching soft affinity rules
InterPodAffinityPriority: Prefers co-location with specific pods
SelectorSpreadPriority: Spreads pods across failure domains

┌─────────────────────────────────────────────────────────────────────────────────┐
│                        SCHEDULER DECISION FLOW                                  │
│                                                                                 │
│   Unscheduled Pod                                                               │
│        │                                                                        │
│        ▼                                                                        │
│  ┌───────────────────────────────────────────────────────────────────────┐     │
│  │                    FILTERING PHASE                                     │     │
│  │                                                                        │     │
│  │   All Nodes: [node-1, node-2, node-3, node-4, node-5, node-6]         │     │
│  │                           │                                            │     │
│  │   PodFitsResources ──────►│ [node-1, node-2, node-3, node-5, node-6]   │     │
│  │   (node-4 lacks memory)   │                                            │     │
│  │                           │                                            │     │
│  │   NodeSelector ──────────►│ [node-1, node-2, node-5]                   │     │
│  │   (requires zone=us-west) │                                            │     │
│  │                           │                                            │     │
│  │   Taints/Tolerations ────►│ [node-1, node-5]                           │     │
│  │   (node-2 has NoSchedule) │                                            │     │
│  └───────────────────────────┴───────────────────────────────────────────┘     │
│                                  │                                              │
│                                  ▼                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐     │
│  │                    SCORING PHASE                                       │     │
│  │                                                                        │     │
│  │   Feasible Nodes: [node-1, node-5]                                    │     │
│  │                                                                        │     │
│  │   LeastRequested: node-1=60, node-5=80                                │     │
│  │   BalancedAlloc:  node-1=70, node-5=50                                │     │
│  │   NodeAffinity:   node-1=100, node-5=100                              │     │
│  │   ──────────────────────────────────────                              │     │
│  │   Total:          node-1=230, node-5=230                              │     │
│  │                                                                        │     │
│  │   Tiebreaker: Random selection → node-5                               │     │
│  └───────────────────────────────────────────────────────────────────────┘     │
│                                  │                                              │
│                                  ▼                                              │
│   Pod scheduled to node-5                                                       │
└─────────────────────────────────────────────────────────────────────────────────┘

Resource Requests and Limits:

Pods specify resource requirements that inform scheduling and enforcement:

Requests:

Minimum resources guaranteed to the container
Used for scheduling decisions
Maps to cgroup 'shares' (CPU) and 'soft limit' (memory)

Limits:

Maximum resources container can use
Enforced at runtime by cgroups
CPU: throttling when exceeded
Memory: OOM killed when exceeded

resources:
  requests:
    cpu: "500m"      # 0.5 CPU cores guaranteed
    memory: "256Mi"  # 256 MiB guaranteed
  limits:
    cpu: "1000m"     # Can burst to 1.0 CPU cores
    memory: "512Mi"  # Hard limit; OOM killed if exceeded

Quality of Service (QoS) Classes:

Guaranteed: requests = limits for all containers
Burstable: requests < limits for some container
BestEffort: no requests or limits specified

When memory pressure occurs, BestEffort pods are evicted first, then Burstable, then Guaranteed.

Over-Provisioning Danger

Affinity, Anti-Affinity, and Taints

Kubernetes provides sophisticated mechanisms for controlling pod placement beyond basic resource matching.

Node Affinity:

Defines where pods prefer or require to be scheduled based on node labels:

┌─────────────────────────────────────────────────────────────────┐
│                      NODE AFFINITY TYPES                         │
│                                                                  │
│  requiredDuringSchedulingIgnoredDuringExecution (HARD)          │
│  ────────────────────────────────────────────────               │
│  Pod MUST be placed on matching node or remains unscheduled     │
│                                                                  │
│  preferredDuringSchedulingIgnoredDuringExecution (SOFT)         │
│  ────────────────────────────────────────────────               │
│  Scheduler prefers matching nodes but will use others if needed │
│  Weight (1-100) determines preference strength                  │
│                                                                  │
│  Example Use Cases:                                              │
│  - GPU workloads require nodes with hardware.nvidia.com/gpu     │
│  - Prefer high-memory nodes for in-memory databases             │
│  - Require specific availability zone for data locality         │
└─────────────────────────────────────────────────────────────────┘

Pod Affinity and Anti-Affinity:

Defines where pods should be placed relative to other pods:

Pod Affinity:

Co-locate pods with specific other pods
Example: Place web server pods on same node as cache pods for low latency
Topology key defines the failure domain (node, zone, region)

Pod Anti-Affinity:

Separate pods from specific other pods
Example: Spread database replicas across different failure domains
Critical for high availability

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    POD ANTI-AFFINITY EXAMPLE                                    │
│                                                                                 │
│   Requirement: Spread database replicas across zones for HA                     │
│                                                                                 │
│   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐           │
│   │   Zone: us-1a   │    │   Zone: us-1b   │    │   Zone: us-1c   │           │
│   │                 │    │                 │    │                 │           │
│   │   ┌─────────┐   │    │   ┌─────────┐   │    │   ┌─────────┐   │           │
│   │   │ DB Pod  │   │    │   │ DB Pod  │   │    │   │ DB Pod  │   │           │
│   │   │ (Primary)│   │    │   │(Replica)│   │    │   │(Replica)│   │           │
│   │   └─────────┘   │    │   └─────────┘   │    │   └─────────┘   │           │
│   │                 │    │                 │    │                 │           │
│   └─────────────────┘    └─────────────────┘    └─────────────────┘           │
│                                                                                 │
│   Anti-affinity rule: topologyKey=topology.kubernetes.io/zone                   │
│   Effect: No two DB pods on same zone                                           │
└─────────────────────────────────────────────────────────────────────────────────┘

Taints and Tolerations:

Taints mark nodes as "off-limits" unless pods explicitly tolerate them:

Taint Effects:

NoSchedule: Pods without toleration won't be scheduled
PreferNoSchedule: Soft version; avoid if possible
NoExecute: Existing pods without toleration are evicted

Common Use Cases:

Dedicated nodes for specific teams or workloads
GPU nodes only for GPU workloads
Master nodes should run only control plane components
Preventing scheduling during maintenance

# Taint a node
kubectl taint nodes gpu-node nvidia.com/gpu=true:NoSchedule

# Pod must have matching toleration
tolerations:
- key: "nvidia.com/gpu"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"

Combining Placement Controls

Kubernetes Networking

The Kubernetes Network Model:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    KUBERNETES NETWORK MODEL                                     │
│                                                                                 │
│   Requirements:                                                                 │
│   1. All pods can communicate without NAT                                       │
│   2. All nodes can communicate with all pods without NAT                        │
│   3. The IP a pod sees for itself is the IP others see for it                  │
│                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │                         POD NETWORK (10.244.0.0/16)                      │  │
│   │                                                                          │  │
│   │   Node 1 (10.244.1.0/24)    Node 2 (10.244.2.0/24)                      │  │
│   │   ┌───────────────────┐     ┌───────────────────┐                       │  │
│   │   │ Pod A: 10.244.1.5 │────►│ Pod C: 10.244.2.3 │                       │  │
│   │   │ Pod B: 10.244.1.8 │     │ Pod D: 10.244.2.7 │                       │  │
│   │   └───────────────────┘     └───────────────────┘                       │  │
│   │            │                         ▲                                   │  │
│   │            └─────────────────────────┘                                   │  │
│   │          (Direct pod-to-pod, no NAT)                                     │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │                      SERVICE NETWORK (10.96.0.0/12)                      │  │
│   │                                                                          │  │
│   │   ClusterIP: 10.96.45.23 (my-service)                                   │  │
│   │       │                                                                  │  │
│   │       ▼ (kube-proxy rules)                                               │  │
│   │   Endpoints: [10.244.1.5:8080, 10.244.2.3:8080]                         │  │
│   │       │                                                                  │  │
│   │       ▼ (load balanced)                                                  │  │
│   │   Selected Pod                                                           │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────────┘

Container Network Interface (CNI) Plugins
Plugin	Network Type	Key Features	Use Case
Calico	Layer 3 (BGP)	Network policy, high performance	Production clusters, policy-heavy
Cilium	eBPF-based	Deep visibility, Kubernetes-native security	Security-focused, observability
Flannel	Overlay (VXLAN)	Simple, lightweight	Development, simple deployments
Weave Net	Overlay + encryption	Encryption, mesh networking	Multi-cloud, security requirements
AWS VPC CNI	Native VPC	Full VPC integration, ENI per pod	AWS EKS deployments
Azure CNI	Native VNet	Azure VNet integration	Azure AKS deployments

Network Policies:

Network Policies are firewall rules for pod traffic, implementing microsegmentation:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: db-policy
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: backend
    ports:
    - protocol: TCP
      port: 5432

Effect: Only pods with label app=backend can connect to database pods on port 5432. All other ingress traffic is denied.

Important: Network Policies require a CNI that supports them (Calico, Cilium, Weave). Flannel does NOT enforce Network Policies.

Default Allow

Storage in Kubernetes

Kubernetes provides storage abstractions that decouple pods from underlying storage implementations, much like how an operating system provides a filesystem abstraction over block devices.

Storage Abstraction Hierarchy:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                      KUBERNETES STORAGE ABSTRACTION                              │
│                                                                                 │
│   Developer/User Level                                                          │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │              PERSISTENT VOLUME CLAIM (PVC)                                │  │
│   │   "I need 10Gi of fast storage with ReadWriteOnce access"                │  │
│   │   - Storage request from application perspective                          │  │
│   │   - Abstracts underlying provider                                         │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                    ▲                                            │
│                                    │ Binding                                    │
│                                    ▼                                            │
│   Cluster Admin Level                                                           │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │               PERSISTENT VOLUME (PV)                                      │  │
│   │   "10Gi volume on AWS EBS gp3, ReadWriteOnce, Delete reclaim"            │  │
│   │   - Actual storage resource provisioned in cluster                        │  │
│   │   - Can be pre-provisioned or dynamically created                         │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                    ▲                                            │
│                                    │ Provisioning                               │
│                                    ▼                                            │
│   Infrastructure Level                                                          │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │               STORAGE CLASS                                               │  │
│   │   "gp3-fast: use aws-ebs provisioner, type=gp3, iops=3000"               │  │
│   │   - Defines HOW to provision storage                                      │  │
│   │   - References Container Storage Interface (CSI) driver                   │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                    ▲                                            │
│                                    │ CSI Driver                                 │
│                                    ▼                                            │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │         UNDERLYING STORAGE PLATFORM                                       │  │
│   │   AWS EBS | GCP PD | Azure Disk | NFS | Ceph | ...                        │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────────┘

Storage Access Modes

•ReadWriteOnce (RWO) — Volume can be mounted read-write by a single node. Most block storage (EBS, GCP PD). Suitable for databases, single-instance applications.
•ReadOnlyMany (ROX) — Volume can be mounted read-only by many nodes. Useful for shared configuration, static assets.
•ReadWriteMany (RWX) — Volume can be mounted read-write by many nodes. Requires shared filesystem (NFS, EFS, CephFS). For shared data across multiple pods.
•ReadWriteOncePod (RWOP) — Volume can be mounted read-write by a single pod (Kubernetes 1.22+). Stricter than RWO for safety.

Container Storage Interface (CSI):

CSI standardizes how storage systems integrate with Kubernetes:

CSI Driver Components:

Controller Plugin: Runs in control plane; handles provisioning, snapshotting
Node Plugin: Runs on each node; handles mount/unmount operations
Identity Service: Reports driver capabilities

Volume Types:

Type	Lifetime	Use Case
EmptyDir	Pod lifetime	Scratch space, caching
HostPath	Node lifetime	Node-level storage (dangerous in production)
PersistentVolumeClaim	Beyond pod	Databases, stateful applications
ConfigMap/Secret	Cluster lifetime	Configuration injection
Projected	Pod lifetime	Combine multiple sources into one mount

StatefulSet Storage Considerations

Production Deployment Patterns

Operating Kubernetes in production requires patterns for reliability, observability, and operational efficiency.

High Availability Patterns:

1. Control Plane HA:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    HIGHLY AVAILABLE CONTROL PLANE                               │
│                                                                                 │
│   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐               │
│   │ Control Plane 1 │  │ Control Plane 2 │  │ Control Plane 3 │               │
│   │                 │  │                 │  │                 │               │
│   │ API Server      │  │ API Server      │  │ API Server      │               │
│   │ Scheduler       │  │ Scheduler       │  │ Scheduler       │               │
│   │ Controller Mgr  │  │ Controller Mgr  │  │ Controller Mgr  │               │
│   │                 │  │                 │  │                 │               │
│   │ etcd member     │  │ etcd member     │  │ etcd member     │               │
│   └────────┬────────┘  └────────┬────────┘  └────────┬────────┘               │
│            │                    │                    │                         │
│            └────────────────────┼────────────────────┘                         │
│                                 │                                              │
│                    ┌────────────┴────────────┐                                 │
│                    │     LOAD BALANCER       │                                 │
│                    │   (API Server endpoint) │                                 │
│                    └─────────────────────────┘                                 │
│                                                                                 │
│   - 3+ control plane nodes (odd number for etcd quorum)                        │
│   - Leader election for Scheduler and Controller Manager                       │
│   - Load balancer in front of API Servers                                      │
└─────────────────────────────────────────────────────────────────────────────────┘

2. Pod Disruption Budgets (PDB):

Ensure enough replicas remain during voluntary disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2    # At least 2 pods must remain
  # OR
  maxUnavailable: 1  # At most 1 pod can be down
  selector:
    matchLabels:
      app: web

Effect: During node drain or upgrades, Kubernetes won't evict pods if it would violate the PDB.

3. Topology Spread Constraints:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: web

Effect: Pods spread evenly across availability zones with max difference of 1 between zones.

Production Best Practices

•Resource Requests/Limits — Always set for predictable scheduling and resource management
•Liveness/Readiness Probes — Define health checks for automatic recovery
•PodDisruptionBudgets — Protect availability during maintenance
•Namespace Isolation — Separate environments, teams, or applications
•RBAC — Least privilege access for users and service accounts

Observability Stack

•Prometheus — Metrics collection and alerting
•Grafana — Visualization and dashboards
•Jaeger/Tempo — Distributed tracing
•Loki/Elasticsearch — Log aggregation
•Kubernetes Dashboard / Lens — Cluster visibility

GitOps Approach

Summary: Container Orchestration (Kubernetes)

Kubernetes serves as the operating system for cloud-native applications, providing resource management, scheduling, networking, and storage abstractions at cluster scale.

Key Takeaways

•Kubernetes is a distributed operating system — API Server is the syscall interface, Scheduler assigns pods to nodes, Controllers maintain desired state, kubelet manages pod lifecycle.
•Pods are the atomic deployment unit — Multiple containers sharing network/storage; managed by Deployments, StatefulSets, DaemonSets, and Jobs.
•Scheduler uses filtering then scoring — Predicates eliminate infeasible nodes; priorities rank remaining nodes for optimal placement.
•Resource requests guarantee scheduling; limits enforce caps — QoS classes (Guaranteed, Burstable, BestEffort) determine eviction priority under pressure.
•Affinity controls co-location — Node affinity selects node pools; pod anti-affinity spreads replicas; taints reserve dedicated nodes.
•Flat network model with CNI plugins — All pods can communicate; Network Policies implement microsegmentation; Services provide stable endpoints.
•Storage abstraction via PVC/PV/StorageClass — CSI drivers integrate storage platforms; StatefulSets provide stable storage per replica.
•Production patterns ensure reliability — HA control plane, PodDisruptionBudgets, topology spread, GitOps workflows.

Looking Ahead:

Page Complete

4 / 5