Kubernetes Workloads - Learning Module

Loading content...

0/273

Jobs and CronJobs: Batch and Scheduled Workloads

Running Tasks to Completion

Not all workloads are meant to run forever. Some are designed to start, execute, and terminate—processing a batch of data, running a database migration, generating a report, or training a machine learning model.

These workloads don't fit the Deployment model (which tries to keep pods running indefinitely) or the StatefulSet model (which maintains persistent identity). They need a different abstraction: one that understands the concept of completion.

This is the domain of Jobs and CronJobs. Jobs run tasks to completion once. CronJobs run Jobs on a schedule, like cron in Unix systems but with Kubernetes' orchestration capabilities.

What You Will Learn

By the end of this page, you'll understand how to design reliable batch processing with Jobs—including parallelism, completion guarantees, and failure handling. You'll master CronJob scheduling for periodic tasks, understand timezone handling, and learn patterns for building robust data pipelines and maintenance automation.

Understanding Jobs

A Kubernetes Job creates one or more pods and ensures that a specified number of them successfully terminate. Unlike Deployments that maintain a desired number of running pods indefinitely, Jobs track completions—successful terminations.

The Job lifecycle:

Converting Mermaid diagram...

Key Job characteristics:

Job Controller Guarantees

•Completion tracking — Job monitors pod exit codes. Exit code 0 = success, non-zero = failure.
•Automatic retries — Failed pods are retried up to backoffLimit times (default: 6).
•Exponential backoff — Retry delays increase exponentially: 10s, 20s, 40s... up to 6 minutes.
•Pod retention — Completed pods are kept (not deleted) so you can view logs. Use TTL controller for cleanup.
•Idempotency is your responsibility — Jobs may create multiple pods if there's ambiguity. Design for at-least-once execution.

Job vs Deployment Comparison
Aspect	Job	Deployment
Goal	Run to completion	Run indefinitely
Success metric	Number of successful completions	Number of ready replicas
Pod restart	On failure (up to backoffLimit)	Always (restartPolicy: Always)
After completion	Job and pods remain (or TTL cleanup)	N/A—pods run forever
Scaling	parallelism and completions	replicas
Use case	Batch processing, migrations	Services, APIs

Job Configuration Deep Dive

Let's examine a comprehensive Job configuration and understand each component:

job-complete.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processor-20241208
  namespace: batch-jobs
  labels:
    app: data-processor
    batch-date: "2024-12-08"
spec:
  # === Completion Configuration ===
  completions: 10           # Total successful completions needed
  parallelism: 3            # Run 3 pods concurrently
  completionMode: Indexed   # Each pod gets unique index 0-(completions-1)
  
  # === Failure Handling ===
  backoffLimit: 4           # Retry failed pods up to 4 times
  
  # === Retry Policy (1.25+) ===
  backoffLimitPerIndex: 2   # Per-index retry limit (Indexed mode)
  maxFailedIndexes: 3       # Fail job if this many indexes fail
  
  # === Timeouts ===
  activeDeadlineSeconds: 3600    # Job must complete in 1 hour
  
  # === Pod Failure Policy (1.26+) ===
  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: processor
        operator: In
        values: [42]  # Exit code 42 means unrecoverable error
    - action: Ignore
      onPodConditions:
      - type: DisruptionTarget  # Ignore eviction failures
  
  # === TTL for Cleanup ===
  ttlSecondsAfterFinished: 86400  # Delete job 24 hours after completion
  
  # === Suspend (1.24+) ===
  suspend: false            # Set true to pause job
  
  # === Pod Template ===
  template:
    metadata:
      labels:
        app: data-processor
    spec:
      restartPolicy: Never  # Required for Jobs (or OnFailure)
      
      # === Init Container ===
      initContainers:
      - name: wait-for-dependencies
        image: busybox:1.36
        command: ['sh', '-c', 'until nc -z kafka.default 9092; do sleep 2; done']
      
      # === Main Container ===
      containers:
      - name: processor
        image: company/data-processor:v2.1.0
        
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "2Gi"
        
        env:
        # Pod index for Indexed completion mode
        - name: JOB_COMPLETION_INDEX
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        - name: TOTAL_COMPLETIONS
          value: "10"
        - name: BATCH_DATE
          value: "2024-12-08"
        - name: DB_CONNECTION
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: connection-string
        
        volumeMounts:
        - name: data
          mountPath: /data
        - name: config
          mountPath: /etc/processor
      
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: batch-data-pvc
      - name: config
        configMap:
          name: processor-config
      
      # === Affinity for batch nodes ===
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: workload-type
                operator: In
                values: ["batch"]

Critical Configuration Elements

•completions — Total number of successful pod completions needed. Job is complete when this many pods exit with code 0.
•parallelism — Maximum pods running concurrently. Setting parallelism=completions runs all at once; parallelism=1 runs sequentially.
•completionMode: Indexed — Each pod receives a unique index (0 to completions-1). Enables work partitioning without external coordination.
•backoffLimit — Maximum retries before the Job is marked failed. Exponential backoff between retries.
•activeDeadlineSeconds — Maximum runtime for the entire Job. Prevents runaway jobs from consuming resources indefinitely.
•ttlSecondsAfterFinished — Automatic cleanup after Job completion. Essential for preventing Job accumulation.
•restartPolicy: Never — Required for Jobs. Use OnFailure if you want the same pod to retry (vs. new pod creation).

Parallelism Patterns

Jobs support multiple parallelism patterns for different batch processing needs. Understanding these patterns helps you choose the right approach for your workload.

Single Pod Execution

The simplest pattern: one pod runs to completion. If it fails, it's retried (up to backoffLimit).

non-parallel-job.yaml
YAML
1
2
3
4
spec:
  completions: 1   # Default
  parallelism: 1   # Default
  # One pod runs, job completes when it succeeds

Use cases: Database migrations, one-off scripts, simple data exports

Choosing the Right Pattern

Use Indexed when: you know the work segments upfront and want simple coordination. Use Work Queue when: work items are dynamic, or you need fine-grained item-level retry. Use Fixed Completion when: items are in a shared queue and pods pull work competitively.

Failure Handling and Retries

Robust batch processing requires sophisticated failure handling. Kubernetes 1.26+ introduces Pod Failure Policy for fine-grained control over how failures are handled.

Basic Failure Configuration:

failure-handling.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
spec:
  # Basic retry configuration
  backoffLimit: 6  # Total retries before job fails
  
  # Time limit for entire job
  activeDeadlineSeconds: 7200  # 2 hours max
  
  # Pod-level restart policy
  template:
    spec:
      restartPolicy: Never  # Create new pod on failure
      # Alternative: OnFailure - restart same pod

Pod Failure Policy (1.26+):

Pod Failure Policy allows you to treat different failures differently—some should retry, some should fail the job immediately, and some should be ignored:

pod-failure-policy.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
spec:
  backoffLimit: 6
  podFailurePolicy:
    rules:
    # Rule 1: Unrecoverable application errors → fail immediately
    - action: FailJob
      onExitCodes:
        containerName: processor
        operator: In
        values:
        - 10   # Configuration error
        - 20   # Invalid input data
        - 42   # Unrecoverable corruption
    
    # Rule 2: Transient errors → count toward backoffLimit (retry)
    - action: Count
      onExitCodes:
        containerName: processor
        operator: In
        values:
        - 1    # Network timeout
        - 2    # Temporary unavailability
    
    # Rule 3: Pod disruption (preemption, eviction) → ignore (don't count as failure)
    - action: Ignore
      onPodConditions:
      - type: DisruptionTarget
    
    # Rule 4: OOM kills → fail job (indicates resource misconfiguration)
    - action: FailJob
      onExitCodes:
        containerName: processor
        operator: In
        values: [137]  # 128 + 9 (SIGKILL from OOM)

Pod Failure Policy Actions
Action	Behavior	Use Case
FailJob	Immediately fail the entire Job	Unrecoverable errors, configuration issues
Count	Count toward backoffLimit, retry	Transient failures, network issues
Ignore	Don't count as failure, retry	Evictions, preemptions, infrastructure issues

Design for Idempotency

Jobs may execute the same work multiple times due to retries, ambiguous pod status, or network partitions. Always design batch jobs to be idempotent—running the same work twice should produce the same result without side effects. Use transaction IDs, upserts, or external deduplication.

CronJobs: Scheduled Job Execution

A CronJob creates Jobs on a schedule, using the same cron syntax familiar from Unix systems. It's the Kubernetes-native way to run periodic batch tasks.

cronjob-complete.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-report-generator
  namespace: reports
spec:
  # === Schedule (Cron Syntax) ===
  schedule: "0 2 * * *"  # 2 AM daily
  
  # === Timezone (1.27+) ===
  timeZone: "America/New_York"  # Explicit timezone
  
  # === Concurrency Policy ===
  concurrencyPolicy: Forbid  # Don't start new if previous running
  # Other options: Allow (default), Replace
  
  # === Deadline ===
  startingDeadlineSeconds: 300  # Must start within 5 min of scheduled time
  
  # === History Limits ===
  successfulJobsHistoryLimit: 3  # Keep last 3 successful jobs
  failedJobsHistoryLimit: 3      # Keep last 3 failed jobs
  
  # === Suspend ===
  suspend: false  # Set true to pause scheduling
  
  # === Job Template ===
  jobTemplate:
    spec:
      activeDeadlineSeconds: 3600
      backoffLimit: 2
      template:
        metadata:
          labels:
            app: report-generator
        spec:
          restartPolicy: OnFailure
          containers:
          - name: generator
            image: company/report-generator:v1.5
            resources:
              requests:
                cpu: "1"
                memory: "2Gi"
              limits:
                cpu: "4"
                memory: "8Gi"
            env:
            - name: REPORT_DATE
              value: "$(date -d 'yesterday' +%Y-%m-%d)"
            - name: OUTPUT_BUCKET
              value: "s3://reports/daily/"
            volumeMounts:
            - name: credentials
              mountPath: /etc/credentials
          volumes:
          - name: credentials
            secret:
              secretName: report-credentials

Cron Schedule Syntax:

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6) (Sunday = 0)
│ │ │ │ │
* * * * *

Common CronJob Schedule Examples
Schedule	Meaning
0 * * * *	Every hour at minute 0
0 0 * * *	Every day at midnight
0 2 * * *	Every day at 2 AM
0 0 * * 0	Every Sunday at midnight
0 0 1 * *	First day of every month at midnight
/15 * * *	Every 15 minutes
0 9-17 * * 1-5	Every hour 9 AM-5 PM, Monday-Friday
0 0 /2 *	Every 2 days at midnight

CronJob Concurrency and Timing

Understanding concurrency policies and timing behavior is crucial for reliable scheduled jobs.

Concurrency Policies

•Allow (default) — Multiple Jobs can run concurrently. Use when runs are independent.
•Forbid — Skip new run if previous still running. Use when concurrent runs could conflict.
•Replace — Stop running Job and start new one. Use when only latest data matters.

Timing Considerations

•startingDeadlineSeconds — If controller misses schedule, how long to still start? Missing deadline = skipped run.
•100+ missed schedules — Controller stops trying and logs error. Restart required.
•Timezone (1.27+) — Use timeZone field to avoid DST surprises.

concurrency-examples.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Pattern 1: Database cleanup - never overlap
spec:
  concurrencyPolicy: Forbid
  schedule: "0 3 * * *"  # 3 AM
  startingDeadlineSeconds: 1800  # 30 min window
  # If job at 3 AM hasn't finished by 4 AM, 4 AM run is skipped
 
---
# Pattern 2: Cache refresh - only latest matters
spec:
  concurrencyPolicy: Replace
  schedule: "*/5 * * * *"  # Every 5 minutes
  # If 00:05 job still running at 00:10, kill it and start fresh
 
---
# Pattern 3: Independent processing - allow overlap
spec:
  concurrencyPolicy: Allow
  schedule: "0 * * * *"  # Hourly
  # Each hour's job processes that hour's data independently

The 100 Missed Schedules Problem

If the CronJob controller misses more than 100 consecutive schedules (e.g., controller was down for extended period), it logs an error and stops scheduling. This is a safety mechanism. After fixing the issue, you may need to delete and recreate the CronJob or manually trigger a run.

Production Patterns for Batch Processing

Let's explore battle-tested patterns for running reliable batch processing in production:

ETL Pipeline with Checkpointing

For data pipelines, implement checkpointing to enable resumable processing:

etl-pipeline.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Daily ETL job that processes incrementally
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etl-pipeline
spec:
  schedule: "0 4 * * *"
  timeZone: "UTC"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      backoffLimit: 3
      template:
        spec:
          containers:
          - name: etl
            image: company/etl:v2.0
            env:
            - name: CHECKPOINT_TABLE
              value: "etl_checkpoints"
            - name: BATCH_SIZE
              value: "10000"
            command:
            - /bin/sh
            - -c
            - |
              # Read last checkpoint
              CHECKPOINT=$(get_checkpoint)
              
              # Process incrementally
              process_data --from=$CHECKPOINT --batch=$BATCH_SIZE
              
              # Update checkpoint on success
              update_checkpoint
          restartPolicy: OnFailure

Debugging Jobs and CronJobs

Batch job failures can be tricky to debug because the pods terminate. Here's a systematic debugging approach:

debug-jobs.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# 1. Check Job status
kubectl describe job <job-name>
 
# Look for:
# - Conditions: Complete, Failed
# - Active/Succeeded/Failed pod counts
# - Events showing pod creation/failure
 
# 2. List pods from a Job (including completed)
kubectl get pods --selector=job-name=<job-name>
 
# 3. Check logs from a completed pod
kubectl logs <pod-name>
kubectl logs <pod-name> --previous  # If pod was restarted
 
# 4. Check CronJob status and schedule
kubectl describe cronjob <cronjob-name>
 
# Look for:
# - Last Schedule Time
# - Active Jobs (currently running)
# - Last Successful Time
 
# 5. List Jobs created by a CronJob
kubectl get jobs --selector=cronjob-name=<cronjob-name>
 
# 6. Check for missed schedules
kubectl get events --field-selector reason=MissSchedule
 
# 7. Debug stuck Jobs
kubectl get pods --selector=job-name=<job-name> -o wide
# Check node issues, pending state, etc.
 
# 8. Force trigger a CronJob (manual run)
kubectl create job --from=cronjob/<cronjob-name> <manual-job-name>

Common Job/CronJob Issues
Symptom	Likely Cause	Solution
Job stuck in Active	Pod hanging or stuck	Check pod status, logs, resource constraints
Job shows all pods Failed	backoffLimit reached	Check logs, fix issue, delete/recreate job
CronJob not triggering	Suspended or schedule syntax error	Check suspend field, validate cron syntax
CronJob runs missed	startingDeadlineSeconds too short	Increase deadline or fix controller issues
Multiple Jobs running	concurrencyPolicy: Allow	Change to Forbid if overlap is problematic
Job cleanup not working	No ttlSecondsAfterFinished	Set TTL or implement manual cleanup

Summary: Jobs and CronJobs

Let's consolidate the essential knowledge about Jobs and CronJobs:

Key Takeaways

•Jobs track completions, not replicas — Success is defined by pods exiting with code 0, not staying running.
•Choose the right parallelism pattern — Indexed for self-partitioning, work queue for dynamic items, fixed completion for queue-based.
•Pod Failure Policy enables smart retries — Distinguish between transient failures (retry) and fatal errors (fail immediately).
•CronJobs are scheduled Job creators — Use standard cron syntax with optional timezone (1.27+).
•Concurrency policy prevents overlap issues — Use Forbid when concurrent runs could conflict, Replace when only latest matters.
•Design for idempotency — Jobs may run multiple times; ensure same results on re-execution.
•Use TTL for automatic cleanup — Set ttlSecondsAfterFinished to prevent Job accumulation.
•Monitor missed schedules — startingDeadlineSeconds and controller health affect reliability.

What's next:

Now that you've mastered all four major Kubernetes workload types—Deployments, StatefulSets, DaemonSets, and Jobs/CronJobs—we'll bring everything together in the final page: Choosing the Right Workload Type. You'll learn decision frameworks for matching your application's requirements with the appropriate Kubernetes abstraction.

Page Complete

You now have a comprehensive understanding of Jobs and CronJobs. You can design reliable batch processing with appropriate parallelism, failure handling, and scheduling. You understand concurrency policies, timing considerations, and production patterns for data pipelines and maintenance automation. Next, we'll synthesize all workload types into a decision framework.