Kubernetes Operations - Learning Module

Loading content...

0/273

Monitoring and Logging

The Eyes and Ears of Your Cluster

A Kubernetes cluster without observability is like flying an airplane without instruments—you're moving fast, but you have no idea where you are, what's around you, or when problems are developing. In the dynamic, distributed world of containers, observability is not optional.

Kubernetes clusters are inherently complex: hundreds or thousands of pods, ephemeral by design, communicating across networks, scheduled and rescheduled by automated controllers. When something goes wrong—and it will—you need deep visibility to understand what happened, why it happened, and how to prevent it from happening again.

The Three Pillars of Observability:

Metrics: Numerical measurements over time (CPU utilization, request latency, error rates)
Logs: Timestamped, structured records of events (application output, audit trails, debug information)
Traces: End-to-end request paths across distributed services (latency breakdown, service dependencies)

This page covers all three pillars with production-grade patterns and tooling.

What You Will Learn

By the end of this page, you'll understand how to deploy Prometheus for metrics collection, implement centralized logging with EFK/Loki, set up distributed tracing with Jaeger, design effective alerting strategies, and build dashboards that provide actionable insights.

Kubernetes Metrics Architecture

Metrics in Kubernetes flow through multiple subsystems, each serving different purposes. Understanding this architecture helps you choose the right tools and configure them correctly.

Core Metrics Pipeline (Resource Metrics):

This built-in pipeline provides basic CPU/memory metrics used by HPA and kubectl top.

cAdvisor (in kubelet) → Metrics Server → Kubernetes API → HPA/VPA/kubectl

cAdvisor: Built into kubelet, collects container/pod resource usage
Metrics Server: Aggregates metrics from all kubelets, serves via API
Consumers: HPA, VPA, kubectl top, scheduler

Full Metrics Pipeline (Prometheus):

For comprehensive monitoring, Prometheus collects metrics from all cluster components.

Targets (kubelet, api-server, pods) → Prometheus → Grafana/Alertmanager

Metric Types:

Prometheus Metric Types
Type	Description	Example	Use Case
Counter	Monotonically increasing value	http_requests_total	Request counts, errors, bytes sent
Gauge	Value that can go up or down	node_memory_usage_bytes	Current state: memory, CPU, queue depth
Histogram	Samples in configurable buckets	request_duration_seconds	Latency distribution, response sizes
Summary	Similar to histogram, calculates quantiles	request_latency_summary	Latency percentiles (client-side calculated)

prometheus-metrics-example.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Application exposing Prometheus metrics
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  annotations:
    # Prometheus annotations for service discovery
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1.0
        ports:
        - name: http
          containerPort: 8080
        - name: metrics
          containerPort: 9090      # Separate metrics port is common
          
---
# ServiceMonitor for Prometheus Operator (recommended)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  labels:
    release: prometheus          # Match Prometheus selector
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    scheme: http
  namespaceSelector:
    matchNames:
    - production

Prometheus Stack: End-to-End Metrics

The kube-prometheus-stack (formerly prometheus-operator) is the standard for Kubernetes monitoring. It deploys a complete observability stack:

Prometheus: Time-series database and query engine
Alertmanager: Alert routing, deduplication, silencing
Grafana: Visualization and dashboards
Node Exporter: Hardware/OS metrics from each node
kube-state-metrics: Kubernetes object state metrics
Prometheus Operator: Manages Prometheus instances via CRDs

Installation with Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack 
  --namespace monitoring 
  --create-namespace 
  --set prometheus.prometheusSpec.retention=30d 
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

prometheus-values.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Helm values for production kube-prometheus-stack
prometheus:
  prometheusSpec:
    # Retention settings
    retention: 30d
    retentionSize: 80GB
    
    # Storage (use persistent storage!)
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3           # Fast SSD storage class
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi
    
    # Resource allocation
    resources:
      requests:
        memory: 4Gi
        cpu: 1000m
      limits:
        memory: 8Gi
        cpu: 4000m
    
    # High availability
    replicas: 2
    
    # Service discovery scope
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    
    # Scrape configuration
    scrapeInterval: 30s
    evaluationInterval: 30s
 
alertmanager:
  alertmanagerSpec:
    replicas: 3
    storage:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 10Gi
  
  # Alert routing configuration
  config:
    global:
      smtp_from: alerts@example.com
      slack_api_url: https://hooks.slack.com/services/xxx
    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: 'slack-notifications'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty-critical'
    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - channel: '#alerts'
        title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
    - name: 'pagerduty-critical'
      pagerduty_configs:
      - service_key: <YOUR_PD_KEY>
 
grafana:
  adminPassword: strongpassword
  persistence:
    enabled: true
    size: 10Gi
  
  # Pre-configured dashboards
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        folder: 'Kubernetes'
        type: file
        options:
          path: /var/lib/grafana/dashboards

Essential Prometheus Configurations

Always use persistent storage: In-memory Prometheus loses all history on restart. 2. Size storage appropriately: Rule of thumb: 2 bytes per sample, 30-day retention, multiply by scrape targets and interval. 3. Run HA: Two Prometheus replicas with same config for redundancy. 4. Configure remote write for long-term storage: Prometheus local storage is not designed for years of history—use Thanos, Cortex, or Mimir for long-term.

Key Kubernetes Metrics to Monitor

Not all metrics are equally valuable. Here are the essential metrics for production Kubernetes clusters, organized by layer:

Cluster-Level Metrics:

essential-promql-queries.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# CLUSTER HEALTH METRICS
 
# Nodes not ready
kube_node_status_condition{condition="Ready",status="true"} == 0
 
# Node resource pressure
kube_node_status_condition{condition=~"MemoryPressure|DiskPressure|PIDPressure",status="true"}
 
# Pods not running (stuck Pending, Failed, Unknown)
sum by (namespace) (kube_pod_status_phase{phase!~"Running|Succeeded"})
 
# Container restarts (indicates crashes)
sum by (namespace, pod) (increase(kube_pod_container_status_restarts_total[1h])) > 3
 
---
# RESOURCE UTILIZATION METRICS
 
# CPU utilization by namespace
sum by (namespace) (rate(container_cpu_usage_seconds_total[5m])) / 
sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"})
 
# Memory utilization by namespace  
sum by (namespace) (container_memory_working_set_bytes) /
sum by (namespace) (kube_pod_container_resource_requests{resource="memory"})
 
# Node CPU utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
 
# Node memory utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
 
---
# APPLICATION METRICS (RED Method)
 
# Request Rate
sum(rate(http_requests_total[5m])) by (service)
 
# Error Rate (5xx responses)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service)
 
# Duration (latency percentiles)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
 
---
# SATURATION METRICS
 
# CPU throttling (indicates limit is too low)
sum by (pod) (rate(container_cpu_cfs_throttled_periods_total[5m])) /
sum by (pod) (rate(container_cpu_cfs_periods_total[5m])) > 0.25
 
# OOM kill events
sum by (namespace, pod) (increase(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[1h]))
 
# Pending pods (indicates capacity issues)
sum(kube_pod_status_phase{phase="Pending"})

Essential Metrics by Category (USE/RED Methods)
Method	Category	Key Metrics	Alert Threshold (Example)
USE	Utilization	CPU %, Memory %, Disk %	80% for 10m
USE	Saturation	CPU throttle %, pending pods	0 throttle events
USE	Errors	Node conditions, container restarts	3 restarts/hour
RED	Rate	Request/s, event/s	Anomaly from baseline
RED	Errors	Error rate %, 5xx count	1% error rate
RED	Duration	p50, p95, p99 latency	p99 > 500ms

USE vs RED Method

USE (Utilization, Saturation, Errors) is for infrastructure: nodes, storage, network. RED (Rate, Errors, Duration) is for services: APIs, microservices, applications. Both are necessary for complete observability.

Alerting Strategies: From Noise to Signal

Alerts are only useful if they're actionable. An alerting system that pages operators for every minor fluctuation creates alert fatigue, where critical alerts are ignored amidst noise.

Principles of Effective Alerting:

Page on symptoms, not causes: Alert on "users can't log in" not "database CPU is high"
Every page should be actionable: If you can't do anything about it, it's not a page
Use severity levels correctly: Critical = wake someone up. Warning = look at it tomorrow
Include runbooks: Every alert should link to documentation on how to investigate
Tune aggressively: Delete or tune alerts that repeatedly don't indicate real problems

prometheus-alerts.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# PrometheusRule resource for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubernetes-critical
    rules:
    # High error rate = page immediately
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
        sum(rate(http_requests_total[5m])) by (service) > 0.05
      for: 5m
      labels:
        severity: critical
        team: platform
      annotations:
        summary: "High error rate on {{ $labels.service }}"
        description: "Error rate is {{ $value | printf "%.2f" }}% (>5% threshold)"
        runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
        
    # Pods crashing repeatedly
    - alert: PodCrashLooping
      expr: |
        increase(kube_pod_container_status_restarts_total[1h]) > 5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"
        description: "Pod has restarted {{ $value }} times in the last hour"
        
    # Node disk filling up
    - alert: NodeDiskRunningFull
      expr: |
        predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"}[6h], 24*60*60) < 0
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "Node {{ $labels.instance }} disk will be full within 24h"
        
    # Deployment has zero replicas
    - alert: DeploymentReplicasUnavailable
      expr: |
        kube_deployment_status_replicas_available == 0 
        and kube_deployment_spec_replicas > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Deployment {{ $labels.deployment }} has no available replicas"
        
  - name: kubernetes-slo
    rules:
    # SLO-based alerting: Burn rate alerts
    - alert: SLOBudgetBurn
      expr: |
        (
          # Fast burn: using 2% of monthly budget in 1 hour
          sum(rate(http_requests_total{status=~"5.."}[1h])) by (service) /
          sum(rate(http_requests_total[1h])) by (service) > 0.02
        ) or (
          # Slow burn: using 5% of monthly budget in 6 hours
          sum(rate(http_requests_total{status=~"5.."}[6h])) by (service) /
          sum(rate(http_requests_total[6h])) by (service) > 0.01
        )
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "SLO budget burn detected for {{ $labels.service }}"

Alert Fatigue Kills

More alerts ≠ better monitoring. Teams that get 100+ alerts/week stop responding to any of them. Target < 5 pages/week per on-call rotation. If you're getting more, either your system is broken or your alerts are.

Kubernetes Logging Architecture

Kubernetes doesn't provide a built-in logging solution—it relies on applications writing to stdout/stderr, which the container runtime captures. You must deploy a log aggregation solution to collect, store, and analyze logs.

Logging Flow:

Application writes to stdout/stderr: Kubernetes logging convention
Container runtime captures output: Stored in /var/log/pods on nodes
Log agent collects logs: DaemonSet running on each node
Aggregation layer stores logs: Elasticsearch, Loki, CloudWatch, etc.
Query interface exposes logs: Kibana, Grafana, etc.

Common Logging Stacks:

EFK (Elasticsearch, Fluentd/Fluent Bit, Kibana): Traditional, feature-rich, resource-intensive
PLG (Promtail, Loki, Grafana): Lightweight, label-based, Prometheus-aligned
Cloud-native: AWS CloudWatch, GCP Cloud Logging, Azure Monitor

Logging Stack Comparison
Stack	Storage Model	Query Language	Resource Usage	Best For
EFK	Full-text indexed	KQL (Kibana Query)	High (GB's RAM for ES)	Complex queries, large teams
PLG (Loki)	Label-indexed, log chunks	LogQL (Prometheus-like)	Low	Cost-sensitive, Prometheus users
Cloud Native	Managed service	Varies by provider	None (managed)	Reduced operational burden

fluent-bit-daemonset.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Fluent Bit DaemonSet for log collection
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
      - operator: Exists      # Run on all nodes including masters
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:2.1
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config
          mountPath: /fluent-bit/etc/
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config
        configMap:
          name: fluent-bit-config
 
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Parsers_File  parsers.conf
    
    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            cri
        Tag               kube.*
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On
        Refresh_Interval  10
    
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
    
    [OUTPUT]
        Name            loki
        Match           *
        Host            loki.logging.svc
        Port            3100
        Labels          job=fluent-bit
        Auto_Kubernetes_Labels on

Structured Logging: JSON All the Way

Unstructured logs (plain text) are difficult to query and analyze at scale. Structured logging (JSON format) transforms logs from opaque text blobs into queryable data.

Why JSON Logs:

Parseable without regex: Log aggregators can extract fields automatically
Consistent schema: Same fields across all services enable cross-service queries
Contextual data: Include request IDs, user IDs, trace IDs for correlation
Efficient indexing: Log systems index JSON fields for fast queries

Structured Log Best Practices:

structured-logging-examples.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// BAD: Unstructured log (hard to query)
"User john@example.com logged in from 192.168.1.1 at 2024-01-15 10:30:00"
 
// GOOD: Structured log (easy to query)
{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "INFO",
  "message": "User logged in",
  "service": "auth-service",
  "version": "v1.2.3",
  
  // Request context
  "request_id": "req-abc123",
  "trace_id": "trace-xyz789",
  "span_id": "span-456",
  
  // Event-specific data
  "user_email": "john@example.com",
  "user_id": "user-123",
  "ip_address": "192.168.1.1",
  "auth_method": "password",
  "mfa_used": true,
  
  // Kubernetes context (added by log agent)
  "kubernetes": {
    "namespace": "production",
    "pod_name": "auth-service-7d8f9c-x4k2m",
    "container_name": "auth",
    "node_name": "node-pool-1-abc"
  }
}
 
// Example queries in Loki LogQL:
// All logs from auth-service with errors:
// {service="auth-service"} |= "error"
 
// Login failures for specific user:
// {service="auth-service"} | json | message="User logged in" | user_email="john@example.com"
 
// High latency requests:
// {service="api-gateway"} | json | duration_ms > 1000

Essential Log Fields

•timestamp: ISO 8601 format with timezone (2024-01-15T10:30:00.000Z)
•level: Consistent levels (DEBUG, INFO, WARN, ERROR, FATAL)
•message: Human-readable description of the event
•service: Name of the emitting service
•trace_id: For distributed tracing correlation
•request_id: For following a single request
•user_id: For audit and debugging user issues
•error: Structured error information (code, message, stack)

PII and Sensitive Data

Be extremely careful about what you log. Never log passwords, tokens, credit card numbers, or other sensitive data. Redact or hash PII fields. Logs often end up in long-term storage with weaker access controls than production databases.

Distributed Tracing: Following Requests Across Services

In microservices architectures, a single user request often traverses many services. When something goes wrong, you need to answer: which service was slow? Where did the error originate? What was the call sequence?

Distributed tracing provides this visibility by correlating related events across services using trace and span IDs.

Tracing Concepts:

Trace: Represents the entire journey of a request through the system
Span: A single operation within a trace (e.g., one service call)
Context Propagation: How trace IDs travel between services (HTTP headers, gRPC metadata)
Sampling: Collecting only a percentage of traces to manage volume

Tracing Standards:

OpenTelemetry: Current standard, supports metrics + logs + traces
W3C Trace Context: HTTP header format for context propagation
Jaeger/Zipkin: Popular backends for trace storage and visualization

opentelemetry-setup.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# OpenTelemetry Collector deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    spec:
      containers:
      - name: collector
        image: otel/opentelemetry-collector-contrib:0.91.0
        args:
        - --config=/etc/otel/config.yaml
        ports:
        - containerPort: 4317      # OTLP gRPC
        - containerPort: 4318      # OTLP HTTP
        - containerPort: 14268     # Jaeger thrift
        - containerPort: 9411      # Zipkin
        volumeMounts:
        - name: config
          mountPath: /etc/otel
      volumes:
      - name: config
        configMap:
          name: otel-collector-config
 
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      jaeger:
        protocols:
          thrift_http:
            endpoint: 0.0.0.0:14268
      zipkin:
        endpoint: 0.0.0.0:9411
    
    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
      memory_limiter:
        limit_mib: 512
        spike_limit_mib: 128
      tail_sampling:
        decision_wait: 30s
        policies:
        - name: error-sampling
          type: status_code
          status_code: {status_codes: [ERROR]}
        - name: slow-request-sampling
          type: latency
          latency: {threshold_ms: 1000}
        - name: probabilistic-sampling
          type: probabilistic
          probabilistic: {sampling_percentage: 10}
    
    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
      prometheus:
        endpoint: 0.0.0.0:8889
    
    service:
      pipelines:
        traces:
          receivers: [otlp, jaeger, zipkin]
          processors: [memory_limiter, batch, tail_sampling]
          exporters: [jaeger]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [prometheus]

Sampling Strategies

Tracing everything is expensive. Use head sampling (decide at start of trace) for simplicity, or tail sampling (decide at end of trace) to ensure you capture errors and slow requests. A common pattern: 100% sample errors and slow requests, 10% sample everything else.

Dashboards: Making Data Actionable

Dashboards translate raw metrics into insight. But the dashboard itself isn't the goal—enabling fast, accurate decision-making is.

Dashboard Design Principles:

Start with questions, not metrics: What decisions will this dashboard inform?
Context over data points: Anyone should understand what they're seeing
Progressive disclosure: Summary → Details on demand
Time comparison: Show same time yesterday/last week
Annotations: Mark deployments, incidents, changes
Thresholds and status: Green/yellow/red at a glance

Essential Dashboard Types:

Dashboard Types for Kubernetes Operations
Dashboard Type	Audience	Key Panels	Update Frequency
Executive Overview	Leadership, stakeholders	SLO status, error budget, availability	Hourly/Daily
Cluster Health	Platform team	Node status, resource utilization, pending pods	Real-time
Service Health	Service owners	Request rate, error rate, latency, saturation	Real-time
Deployment	DevOps, SREs	Rollout status, replica count, error spikes	Real-time
Cost	FinOps, management	Resource usage vs allocation, cost by team	Daily

grafana-dashboard-panels.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Example Grafana dashboard structure for service health
{
  "title": "Service Health Dashboard",
  "rows": [
    {
      "title": "Overview",
      "panels": [
        {
          "title": "Request Rate",
          "type": "stat",
          "targets": [{"expr": "sum(rate(http_requests_total[5m]))"}],
          "thresholds": {"steps": [
            {"value": 0, "color": "green"},
            {"value": 10000, "color": "yellow"}
          ]}
        },
        {
          "title": "Error Rate",
          "type": "gauge",
          "targets": [{"expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100"}],
          "thresholds": {"steps": [
            {"value": 0, "color": "green"},
            {"value": 1, "color": "yellow"},
            {"value": 5, "color": "red"}
          ]}
        },
        {
          "title": "P99 Latency",
          "type": "stat",
          "unit": "ms",
          "targets": [{"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) * 1000"}]
        }
      ]
    },
    {
      "title": "Trends",
      "panels": [
        {
          "title": "Request Rate Over Time",
          "type": "timeseries",
          "targets": [{"expr": "sum(rate(http_requests_total[5m])) by (service)"}],
          "annotations": {"deployments": true}
        },
        {
          "title": "Latency Distribution",
          "type": "heatmap",
          "targets": [{"expr": "sum(rate(http_request_duration_seconds_bucket[1m])) by (le)"}]
        }
      ]
    }
  ]
}

Dashboard Anti-Patterns

Wall of numbers: Dashboards with 50+ panels are useless in incidents. 2. No context: A line going up is meaningless without thresholds. 3. Vanity metrics: Metrics that look good but don't inform action. 4. Missing annotations: Without deployment markers, correlating issues is guesswork.

Summary: Observability Best Practices

Comprehensive observability transforms Kubernetes from an opaque black box into a transparent, debuggable system. Let's consolidate the key principles:

Key Takeaways

•Instrument the three pillars: Metrics, logs, and traces provide complementary insights
•Use Prometheus for metrics: Industry standard with excellent Kubernetes integration
•Choose appropriate log aggregation: EFK for complex queries, Loki for cost efficiency
•Structure your logs as JSON: Enable querying without regex, include trace IDs
•Implement distributed tracing: Essential for microservices debugging
•Alert on symptoms, not causes: Page when users are affected, not when CPU spikes
•Tune alerts aggressively: Alert fatigue is real; target < 5 pages/week
•Build purposeful dashboards: Answer specific questions, don't just show data
•Use USE and RED methods: Systematic approach to essential metrics
•Automate observability deployment: Monitoring should be infrastructure-as-code too

What's Next:

With observability in place, you can see what's happening in your cluster. The final page covers Security Best Practices—essential hardening techniques to protect your Kubernetes workloads from threats, vulnerabilities, and misconfigurations.

Page Complete

You now have comprehensive knowledge of Kubernetes observability—from metrics pipelines and Prometheus through logging architectures, distributed tracing, alerting strategies, and dashboard design. Apply these patterns to gain deep visibility into your production clusters.