Loading content...
A Kubernetes cluster without observability is like flying an airplane without instruments—you're moving fast, but you have no idea where you are, what's around you, or when problems are developing. In the dynamic, distributed world of containers, observability is not optional.
Kubernetes clusters are inherently complex: hundreds or thousands of pods, ephemeral by design, communicating across networks, scheduled and rescheduled by automated controllers. When something goes wrong—and it will—you need deep visibility to understand what happened, why it happened, and how to prevent it from happening again.
The Three Pillars of Observability:
This page covers all three pillars with production-grade patterns and tooling.
By the end of this page, you'll understand how to deploy Prometheus for metrics collection, implement centralized logging with EFK/Loki, set up distributed tracing with Jaeger, design effective alerting strategies, and build dashboards that provide actionable insights.
Metrics in Kubernetes flow through multiple subsystems, each serving different purposes. Understanding this architecture helps you choose the right tools and configure them correctly.
Core Metrics Pipeline (Resource Metrics):
This built-in pipeline provides basic CPU/memory metrics used by HPA and kubectl top.
cAdvisor (in kubelet) → Metrics Server → Kubernetes API → HPA/VPA/kubectl
Full Metrics Pipeline (Prometheus):
For comprehensive monitoring, Prometheus collects metrics from all cluster components.
Targets (kubelet, api-server, pods) → Prometheus → Grafana/Alertmanager
Metric Types:
| Type | Description | Example | Use Case |
|---|---|---|---|
| Counter | Monotonically increasing value | http_requests_total | Request counts, errors, bytes sent |
| Gauge | Value that can go up or down | node_memory_usage_bytes | Current state: memory, CPU, queue depth |
| Histogram | Samples in configurable buckets | request_duration_seconds | Latency distribution, response sizes |
| Summary | Similar to histogram, calculates quantiles | request_latency_summary | Latency percentiles (client-side calculated) |
123456789101112131415161718192021222324252627282930313233343536373839404142
# Application exposing Prometheus metricsapiVersion: apps/v1kind: Deploymentmetadata: name: myapp annotations: # Prometheus annotations for service discovery prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics"spec: template: spec: containers: - name: app image: myapp:v1.0 ports: - name: http containerPort: 8080 - name: metrics containerPort: 9090 # Separate metrics port is common ---# ServiceMonitor for Prometheus Operator (recommended)apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: myapp-monitor labels: release: prometheus # Match Prometheus selectorspec: selector: matchLabels: app: myapp endpoints: - port: metrics interval: 30s path: /metrics scheme: http namespaceSelector: matchNames: - productionThe kube-prometheus-stack (formerly prometheus-operator) is the standard for Kubernetes monitoring. It deploys a complete observability stack:
Installation with Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack
--namespace monitoring
--create-namespace
--set prometheus.prometheusSpec.retention=30d
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
# Helm values for production kube-prometheus-stackprometheus: prometheusSpec: # Retention settings retention: 30d retentionSize: 80GB # Storage (use persistent storage!) storageSpec: volumeClaimTemplate: spec: storageClassName: gp3 # Fast SSD storage class accessModes: ["ReadWriteOnce"] resources: requests: storage: 100Gi # Resource allocation resources: requests: memory: 4Gi cpu: 1000m limits: memory: 8Gi cpu: 4000m # High availability replicas: 2 # Service discovery scope serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false # Scrape configuration scrapeInterval: 30s evaluationInterval: 30s alertmanager: alertmanagerSpec: replicas: 3 storage: volumeClaimTemplate: spec: resources: requests: storage: 10Gi # Alert routing configuration config: global: smtp_from: alerts@example.com slack_api_url: https://hooks.slack.com/services/xxx route: group_by: ['alertname', 'namespace'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'slack-notifications' routes: - match: severity: critical receiver: 'pagerduty-critical' receivers: - name: 'slack-notifications' slack_configs: - channel: '#alerts' title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}' - name: 'pagerduty-critical' pagerduty_configs: - service_key: <YOUR_PD_KEY> grafana: adminPassword: strongpassword persistence: enabled: true size: 10Gi # Pre-configured dashboards dashboardProviders: dashboardproviders.yaml: apiVersion: 1 providers: - name: 'default' folder: 'Kubernetes' type: file options: path: /var/lib/grafana/dashboardsNot all metrics are equally valuable. Here are the essential metrics for production Kubernetes clusters, organized by layer:
Cluster-Level Metrics:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
# CLUSTER HEALTH METRICS # Nodes not readykube_node_status_condition{condition="Ready",status="true"} == 0 # Node resource pressurekube_node_status_condition{condition=~"MemoryPressure|DiskPressure|PIDPressure",status="true"} # Pods not running (stuck Pending, Failed, Unknown)sum by (namespace) (kube_pod_status_phase{phase!~"Running|Succeeded"}) # Container restarts (indicates crashes)sum by (namespace, pod) (increase(kube_pod_container_status_restarts_total[1h])) > 3 ---# RESOURCE UTILIZATION METRICS # CPU utilization by namespacesum by (namespace) (rate(container_cpu_usage_seconds_total[5m])) / sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"}) # Memory utilization by namespace sum by (namespace) (container_memory_working_set_bytes) /sum by (namespace) (kube_pod_container_resource_requests{resource="memory"}) # Node CPU utilization100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Node memory utilization(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 ---# APPLICATION METRICS (RED Method) # Request Ratesum(rate(http_requests_total[5m])) by (service) # Error Rate (5xx responses)sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /sum(rate(http_requests_total[5m])) by (service) # Duration (latency percentiles)histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) ---# SATURATION METRICS # CPU throttling (indicates limit is too low)sum by (pod) (rate(container_cpu_cfs_throttled_periods_total[5m])) /sum by (pod) (rate(container_cpu_cfs_periods_total[5m])) > 0.25 # OOM kill eventssum by (namespace, pod) (increase(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[1h])) # Pending pods (indicates capacity issues)sum(kube_pod_status_phase{phase="Pending"})| Method | Category | Key Metrics | Alert Threshold (Example) |
|---|---|---|---|
| USE | Utilization | CPU %, Memory %, Disk % | 80% for 10m |
| USE | Saturation | CPU throttle %, pending pods | 0 throttle events |
| USE | Errors | Node conditions, container restarts | 3 restarts/hour |
| RED | Rate | Request/s, event/s | Anomaly from baseline |
| RED | Errors | Error rate %, 5xx count | 1% error rate |
| RED | Duration | p50, p95, p99 latency | p99 > 500ms |
USE (Utilization, Saturation, Errors) is for infrastructure: nodes, storage, network. RED (Rate, Errors, Duration) is for services: APIs, microservices, applications. Both are necessary for complete observability.
Alerts are only useful if they're actionable. An alerting system that pages operators for every minor fluctuation creates alert fatigue, where critical alerts are ignored amidst noise.
Principles of Effective Alerting:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
# PrometheusRule resource for Prometheus OperatorapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: kubernetes-alerts namespace: monitoringspec: groups: - name: kubernetes-critical rules: # High error rate = page immediately - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05 for: 5m labels: severity: critical team: platform annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | printf "%.2f" }}% (>5% threshold)" runbook_url: "https://wiki.example.com/runbooks/high-error-rate" # Pods crashing repeatedly - alert: PodCrashLooping expr: | increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 10m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod has restarted {{ $value }} times in the last hour" # Node disk filling up - alert: NodeDiskRunningFull expr: | predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"}[6h], 24*60*60) < 0 for: 30m labels: severity: warning annotations: summary: "Node {{ $labels.instance }} disk will be full within 24h" # Deployment has zero replicas - alert: DeploymentReplicasUnavailable expr: | kube_deployment_status_replicas_available == 0 and kube_deployment_spec_replicas > 0 for: 5m labels: severity: critical annotations: summary: "Deployment {{ $labels.deployment }} has no available replicas" - name: kubernetes-slo rules: # SLO-based alerting: Burn rate alerts - alert: SLOBudgetBurn expr: | ( # Fast burn: using 2% of monthly budget in 1 hour sum(rate(http_requests_total{status=~"5.."}[1h])) by (service) / sum(rate(http_requests_total[1h])) by (service) > 0.02 ) or ( # Slow burn: using 5% of monthly budget in 6 hours sum(rate(http_requests_total{status=~"5.."}[6h])) by (service) / sum(rate(http_requests_total[6h])) by (service) > 0.01 ) for: 2m labels: severity: critical annotations: summary: "SLO budget burn detected for {{ $labels.service }}"More alerts ≠ better monitoring. Teams that get 100+ alerts/week stop responding to any of them. Target < 5 pages/week per on-call rotation. If you're getting more, either your system is broken or your alerts are.
Kubernetes doesn't provide a built-in logging solution—it relies on applications writing to stdout/stderr, which the container runtime captures. You must deploy a log aggregation solution to collect, store, and analyze logs.
Logging Flow:
Common Logging Stacks:
| Stack | Storage Model | Query Language | Resource Usage | Best For |
|---|---|---|---|---|
| EFK | Full-text indexed | KQL (Kibana Query) | High (GB's RAM for ES) | Complex queries, large teams |
| PLG (Loki) | Label-indexed, log chunks | LogQL (Prometheus-like) | Low | Cost-sensitive, Prometheus users |
| Cloud Native | Managed service | Varies by provider | None (managed) | Reduced operational burden |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
# Fluent Bit DaemonSet for log collectionapiVersion: apps/v1kind: DaemonSetmetadata: name: fluent-bit namespace: loggingspec: selector: matchLabels: app: fluent-bit template: metadata: labels: app: fluent-bit spec: serviceAccountName: fluent-bit tolerations: - operator: Exists # Run on all nodes including masters containers: - name: fluent-bit image: fluent/fluent-bit:2.1 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 256Mi volumeMounts: - name: varlog mountPath: /var/log readOnly: true - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true - name: config mountPath: /fluent-bit/etc/ volumes: - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers - name: config configMap: name: fluent-bit-config ---apiVersion: v1kind: ConfigMapmetadata: name: fluent-bit-config namespace: loggingdata: fluent-bit.conf: | [SERVICE] Flush 5 Log_Level info Parsers_File parsers.conf [INPUT] Name tail Path /var/log/containers/*.log Parser cri Tag kube.* Mem_Buf_Limit 50MB Skip_Long_Lines On Refresh_Interval 10 [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Merge_Log On K8S-Logging.Parser On K8S-Logging.Exclude On [OUTPUT] Name loki Match * Host loki.logging.svc Port 3100 Labels job=fluent-bit Auto_Kubernetes_Labels onUnstructured logs (plain text) are difficult to query and analyze at scale. Structured logging (JSON format) transforms logs from opaque text blobs into queryable data.
Why JSON Logs:
Structured Log Best Practices:
1234567891011121314151617181920212223242526272829303132333435363738394041
// BAD: Unstructured log (hard to query)"User john@example.com logged in from 192.168.1.1 at 2024-01-15 10:30:00" // GOOD: Structured log (easy to query){ "timestamp": "2024-01-15T10:30:00.000Z", "level": "INFO", "message": "User logged in", "service": "auth-service", "version": "v1.2.3", // Request context "request_id": "req-abc123", "trace_id": "trace-xyz789", "span_id": "span-456", // Event-specific data "user_email": "john@example.com", "user_id": "user-123", "ip_address": "192.168.1.1", "auth_method": "password", "mfa_used": true, // Kubernetes context (added by log agent) "kubernetes": { "namespace": "production", "pod_name": "auth-service-7d8f9c-x4k2m", "container_name": "auth", "node_name": "node-pool-1-abc" }} // Example queries in Loki LogQL:// All logs from auth-service with errors:// {service="auth-service"} |= "error" // Login failures for specific user:// {service="auth-service"} | json | message="User logged in" | user_email="john@example.com" // High latency requests:// {service="api-gateway"} | json | duration_ms > 1000Be extremely careful about what you log. Never log passwords, tokens, credit card numbers, or other sensitive data. Redact or hash PII fields. Logs often end up in long-term storage with weaker access controls than production databases.
In microservices architectures, a single user request often traverses many services. When something goes wrong, you need to answer: which service was slow? Where did the error originate? What was the call sequence?
Distributed tracing provides this visibility by correlating related events across services using trace and span IDs.
Tracing Concepts:
Tracing Standards:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
# OpenTelemetry Collector deploymentapiVersion: apps/v1kind: Deploymentmetadata: name: otel-collector namespace: observabilityspec: replicas: 2 selector: matchLabels: app: otel-collector template: spec: containers: - name: collector image: otel/opentelemetry-collector-contrib:0.91.0 args: - --config=/etc/otel/config.yaml ports: - containerPort: 4317 # OTLP gRPC - containerPort: 4318 # OTLP HTTP - containerPort: 14268 # Jaeger thrift - containerPort: 9411 # Zipkin volumeMounts: - name: config mountPath: /etc/otel volumes: - name: config configMap: name: otel-collector-config ---apiVersion: v1kind: ConfigMapmetadata: name: otel-collector-configdata: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 jaeger: protocols: thrift_http: endpoint: 0.0.0.0:14268 zipkin: endpoint: 0.0.0.0:9411 processors: batch: timeout: 1s send_batch_size: 1024 memory_limiter: limit_mib: 512 spike_limit_mib: 128 tail_sampling: decision_wait: 30s policies: - name: error-sampling type: status_code status_code: {status_codes: [ERROR]} - name: slow-request-sampling type: latency latency: {threshold_ms: 1000} - name: probabilistic-sampling type: probabilistic probabilistic: {sampling_percentage: 10} exporters: jaeger: endpoint: jaeger-collector:14250 tls: insecure: true prometheus: endpoint: 0.0.0.0:8889 service: pipelines: traces: receivers: [otlp, jaeger, zipkin] processors: [memory_limiter, batch, tail_sampling] exporters: [jaeger] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus]Tracing everything is expensive. Use head sampling (decide at start of trace) for simplicity, or tail sampling (decide at end of trace) to ensure you capture errors and slow requests. A common pattern: 100% sample errors and slow requests, 10% sample everything else.
Dashboards translate raw metrics into insight. But the dashboard itself isn't the goal—enabling fast, accurate decision-making is.
Dashboard Design Principles:
Essential Dashboard Types:
| Dashboard Type | Audience | Key Panels | Update Frequency |
|---|---|---|---|
| Executive Overview | Leadership, stakeholders | SLO status, error budget, availability | Hourly/Daily |
| Cluster Health | Platform team | Node status, resource utilization, pending pods | Real-time |
| Service Health | Service owners | Request rate, error rate, latency, saturation | Real-time |
| Deployment | DevOps, SREs | Rollout status, replica count, error spikes | Real-time |
| Cost | FinOps, management | Resource usage vs allocation, cost by team | Daily |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Example Grafana dashboard structure for service health{ "title": "Service Health Dashboard", "rows": [ { "title": "Overview", "panels": [ { "title": "Request Rate", "type": "stat", "targets": [{"expr": "sum(rate(http_requests_total[5m]))"}], "thresholds": {"steps": [ {"value": 0, "color": "green"}, {"value": 10000, "color": "yellow"} ]} }, { "title": "Error Rate", "type": "gauge", "targets": [{"expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100"}], "thresholds": {"steps": [ {"value": 0, "color": "green"}, {"value": 1, "color": "yellow"}, {"value": 5, "color": "red"} ]} }, { "title": "P99 Latency", "type": "stat", "unit": "ms", "targets": [{"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) * 1000"}] } ] }, { "title": "Trends", "panels": [ { "title": "Request Rate Over Time", "type": "timeseries", "targets": [{"expr": "sum(rate(http_requests_total[5m])) by (service)"}], "annotations": {"deployments": true} }, { "title": "Latency Distribution", "type": "heatmap", "targets": [{"expr": "sum(rate(http_request_duration_seconds_bucket[1m])) by (le)"}] } ] } ]}Comprehensive observability transforms Kubernetes from an opaque black box into a transparent, debuggable system. Let's consolidate the key principles:
What's Next:
With observability in place, you can see what's happening in your cluster. The final page covers Security Best Practices—essential hardening techniques to protect your Kubernetes workloads from threats, vulnerabilities, and misconfigurations.
You now have comprehensive knowledge of Kubernetes observability—from metrics pipelines and Prometheus through logging architectures, distributed tracing, alerting strategies, and dashboard design. Apply these patterns to gain deep visibility into your production clusters.