Loading content...
Not all workloads are meant to run forever. Some are designed to start, execute, and terminate—processing a batch of data, running a database migration, generating a report, or training a machine learning model.
These workloads don't fit the Deployment model (which tries to keep pods running indefinitely) or the StatefulSet model (which maintains persistent identity). They need a different abstraction: one that understands the concept of completion.
This is the domain of Jobs and CronJobs. Jobs run tasks to completion once. CronJobs run Jobs on a schedule, like cron in Unix systems but with Kubernetes' orchestration capabilities.
By the end of this page, you'll understand how to design reliable batch processing with Jobs—including parallelism, completion guarantees, and failure handling. You'll master CronJob scheduling for periodic tasks, understand timezone handling, and learn patterns for building robust data pipelines and maintenance automation.
A Kubernetes Job creates one or more pods and ensures that a specified number of them successfully terminate. Unlike Deployments that maintain a desired number of running pods indefinitely, Jobs track completions—successful terminations.
The Job lifecycle:
Key Job characteristics:
backoffLimit times (default: 6).| Aspect | Job | Deployment |
|---|---|---|
| Goal | Run to completion | Run indefinitely |
| Success metric | Number of successful completions | Number of ready replicas |
| Pod restart | On failure (up to backoffLimit) | Always (restartPolicy: Always) |
| After completion | Job and pods remain (or TTL cleanup) | N/A—pods run forever |
| Scaling | parallelism and completions | replicas |
| Use case | Batch processing, migrations | Services, APIs |
Let's examine a comprehensive Job configuration and understand each component:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
apiVersion: batch/v1kind: Jobmetadata: name: data-processor-20241208 namespace: batch-jobs labels: app: data-processor batch-date: "2024-12-08"spec: # === Completion Configuration === completions: 10 # Total successful completions needed parallelism: 3 # Run 3 pods concurrently completionMode: Indexed # Each pod gets unique index 0-(completions-1) # === Failure Handling === backoffLimit: 4 # Retry failed pods up to 4 times # === Retry Policy (1.25+) === backoffLimitPerIndex: 2 # Per-index retry limit (Indexed mode) maxFailedIndexes: 3 # Fail job if this many indexes fail # === Timeouts === activeDeadlineSeconds: 3600 # Job must complete in 1 hour # === Pod Failure Policy (1.26+) === podFailurePolicy: rules: - action: FailJob onExitCodes: containerName: processor operator: In values: [42] # Exit code 42 means unrecoverable error - action: Ignore onPodConditions: - type: DisruptionTarget # Ignore eviction failures # === TTL for Cleanup === ttlSecondsAfterFinished: 86400 # Delete job 24 hours after completion # === Suspend (1.24+) === suspend: false # Set true to pause job # === Pod Template === template: metadata: labels: app: data-processor spec: restartPolicy: Never # Required for Jobs (or OnFailure) # === Init Container === initContainers: - name: wait-for-dependencies image: busybox:1.36 command: ['sh', '-c', 'until nc -z kafka.default 9092; do sleep 2; done'] # === Main Container === containers: - name: processor image: company/data-processor:v2.1.0 resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "2" memory: "2Gi" env: # Pod index for Indexed completion mode - name: JOB_COMPLETION_INDEX valueFrom: fieldRef: fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] - name: TOTAL_COMPLETIONS value: "10" - name: BATCH_DATE value: "2024-12-08" - name: DB_CONNECTION valueFrom: secretKeyRef: name: db-credentials key: connection-string volumeMounts: - name: data mountPath: /data - name: config mountPath: /etc/processor volumes: - name: data persistentVolumeClaim: claimName: batch-data-pvc - name: config configMap: name: processor-config # === Affinity for batch nodes === affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: workload-type operator: In values: ["batch"]OnFailure if you want the same pod to retry (vs. new pod creation).Jobs support multiple parallelism patterns for different batch processing needs. Understanding these patterns helps you choose the right approach for your workload.
Single Pod Execution
The simplest pattern: one pod runs to completion. If it fails, it's retried (up to backoffLimit).
1234
spec: completions: 1 # Default parallelism: 1 # Default # One pod runs, job completes when it succeedsUse cases: Database migrations, one-off scripts, simple data exports
Use Indexed when: you know the work segments upfront and want simple coordination. Use Work Queue when: work items are dynamic, or you need fine-grained item-level retry. Use Fixed Completion when: items are in a shared queue and pods pull work competitively.
Robust batch processing requires sophisticated failure handling. Kubernetes 1.26+ introduces Pod Failure Policy for fine-grained control over how failures are handled.
Basic Failure Configuration:
123456789101112
spec: # Basic retry configuration backoffLimit: 6 # Total retries before job fails # Time limit for entire job activeDeadlineSeconds: 7200 # 2 hours max # Pod-level restart policy template: spec: restartPolicy: Never # Create new pod on failure # Alternative: OnFailure - restart same podPod Failure Policy (1.26+):
Pod Failure Policy allows you to treat different failures differently—some should retry, some should fail the job immediately, and some should be ignored:
12345678910111213141516171819202122232425262728293031323334
spec: backoffLimit: 6 podFailurePolicy: rules: # Rule 1: Unrecoverable application errors → fail immediately - action: FailJob onExitCodes: containerName: processor operator: In values: - 10 # Configuration error - 20 # Invalid input data - 42 # Unrecoverable corruption # Rule 2: Transient errors → count toward backoffLimit (retry) - action: Count onExitCodes: containerName: processor operator: In values: - 1 # Network timeout - 2 # Temporary unavailability # Rule 3: Pod disruption (preemption, eviction) → ignore (don't count as failure) - action: Ignore onPodConditions: - type: DisruptionTarget # Rule 4: OOM kills → fail job (indicates resource misconfiguration) - action: FailJob onExitCodes: containerName: processor operator: In values: [137] # 128 + 9 (SIGKILL from OOM)| Action | Behavior | Use Case |
|---|---|---|
| FailJob | Immediately fail the entire Job | Unrecoverable errors, configuration issues |
| Count | Count toward backoffLimit, retry | Transient failures, network issues |
| Ignore | Don't count as failure, retry | Evictions, preemptions, infrastructure issues |
Jobs may execute the same work multiple times due to retries, ambiguous pod status, or network partitions. Always design batch jobs to be idempotent—running the same work twice should produce the same result without side effects. Use transaction IDs, upserts, or external deduplication.
A CronJob creates Jobs on a schedule, using the same cron syntax familiar from Unix systems. It's the Kubernetes-native way to run periodic batch tasks.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
apiVersion: batch/v1kind: CronJobmetadata: name: daily-report-generator namespace: reportsspec: # === Schedule (Cron Syntax) === schedule: "0 2 * * *" # 2 AM daily # === Timezone (1.27+) === timeZone: "America/New_York" # Explicit timezone # === Concurrency Policy === concurrencyPolicy: Forbid # Don't start new if previous running # Other options: Allow (default), Replace # === Deadline === startingDeadlineSeconds: 300 # Must start within 5 min of scheduled time # === History Limits === successfulJobsHistoryLimit: 3 # Keep last 3 successful jobs failedJobsHistoryLimit: 3 # Keep last 3 failed jobs # === Suspend === suspend: false # Set true to pause scheduling # === Job Template === jobTemplate: spec: activeDeadlineSeconds: 3600 backoffLimit: 2 template: metadata: labels: app: report-generator spec: restartPolicy: OnFailure containers: - name: generator image: company/report-generator:v1.5 resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "4" memory: "8Gi" env: - name: REPORT_DATE value: "$(date -d 'yesterday' +%Y-%m-%d)" - name: OUTPUT_BUCKET value: "s3://reports/daily/" volumeMounts: - name: credentials mountPath: /etc/credentials volumes: - name: credentials secret: secretName: report-credentialsCron Schedule Syntax:
┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6) (Sunday = 0)
│ │ │ │ │
* * * * *
| Schedule | Meaning |
|---|---|
| 0 * * * * | Every hour at minute 0 |
| 0 0 * * * | Every day at midnight |
| 0 2 * * * | Every day at 2 AM |
| 0 0 * * 0 | Every Sunday at midnight |
| 0 0 1 * * | First day of every month at midnight |
| */15 * * * * | Every 15 minutes |
| 0 9-17 * * 1-5 | Every hour 9 AM-5 PM, Monday-Friday |
| 0 0 */2 * * | Every 2 days at midnight |
Understanding concurrency policies and timing behavior is crucial for reliable scheduled jobs.
1234567891011121314151617181920
# Pattern 1: Database cleanup - never overlapspec: concurrencyPolicy: Forbid schedule: "0 3 * * *" # 3 AM startingDeadlineSeconds: 1800 # 30 min window # If job at 3 AM hasn't finished by 4 AM, 4 AM run is skipped ---# Pattern 2: Cache refresh - only latest mattersspec: concurrencyPolicy: Replace schedule: "*/5 * * * *" # Every 5 minutes # If 00:05 job still running at 00:10, kill it and start fresh ---# Pattern 3: Independent processing - allow overlapspec: concurrencyPolicy: Allow schedule: "0 * * * *" # Hourly # Each hour's job processes that hour's data independentlyIf the CronJob controller misses more than 100 consecutive schedules (e.g., controller was down for extended period), it logs an error and stops scheduling. This is a safety mechanism. After fixing the issue, you may need to delete and recreate the CronJob or manually trigger a run.
Let's explore battle-tested patterns for running reliable batch processing in production:
ETL Pipeline with Checkpointing
For data pipelines, implement checkpointing to enable resumable processing:
1234567891011121314151617181920212223242526272829303132333435
# Daily ETL job that processes incrementallyapiVersion: batch/v1kind: CronJobmetadata: name: etl-pipelinespec: schedule: "0 4 * * *" timeZone: "UTC" concurrencyPolicy: Forbid jobTemplate: spec: backoffLimit: 3 template: spec: containers: - name: etl image: company/etl:v2.0 env: - name: CHECKPOINT_TABLE value: "etl_checkpoints" - name: BATCH_SIZE value: "10000" command: - /bin/sh - -c - | # Read last checkpoint CHECKPOINT=$(get_checkpoint) # Process incrementally process_data --from=$CHECKPOINT --batch=$BATCH_SIZE # Update checkpoint on success update_checkpoint restartPolicy: OnFailureBatch job failures can be tricky to debug because the pods terminate. Here's a systematic debugging approach:
1234567891011121314151617181920212223242526272829303132333435
# 1. Check Job statuskubectl describe job <job-name> # Look for:# - Conditions: Complete, Failed# - Active/Succeeded/Failed pod counts# - Events showing pod creation/failure # 2. List pods from a Job (including completed)kubectl get pods --selector=job-name=<job-name> # 3. Check logs from a completed podkubectl logs <pod-name>kubectl logs <pod-name> --previous # If pod was restarted # 4. Check CronJob status and schedulekubectl describe cronjob <cronjob-name> # Look for:# - Last Schedule Time# - Active Jobs (currently running)# - Last Successful Time # 5. List Jobs created by a CronJobkubectl get jobs --selector=cronjob-name=<cronjob-name> # 6. Check for missed scheduleskubectl get events --field-selector reason=MissSchedule # 7. Debug stuck Jobskubectl get pods --selector=job-name=<job-name> -o wide# Check node issues, pending state, etc. # 8. Force trigger a CronJob (manual run)kubectl create job --from=cronjob/<cronjob-name> <manual-job-name>| Symptom | Likely Cause | Solution |
|---|---|---|
| Job stuck in Active | Pod hanging or stuck | Check pod status, logs, resource constraints |
| Job shows all pods Failed | backoffLimit reached | Check logs, fix issue, delete/recreate job |
| CronJob not triggering | Suspended or schedule syntax error | Check suspend field, validate cron syntax |
| CronJob runs missed | startingDeadlineSeconds too short | Increase deadline or fix controller issues |
| Multiple Jobs running | concurrencyPolicy: Allow | Change to Forbid if overlap is problematic |
| Job cleanup not working | No ttlSecondsAfterFinished | Set TTL or implement manual cleanup |
Let's consolidate the essential knowledge about Jobs and CronJobs:
What's next:
Now that you've mastered all four major Kubernetes workload types—Deployments, StatefulSets, DaemonSets, and Jobs/CronJobs—we'll bring everything together in the final page: Choosing the Right Workload Type. You'll learn decision frameworks for matching your application's requirements with the appropriate Kubernetes abstraction.
You now have a comprehensive understanding of Jobs and CronJobs. You can design reliable batch processing with appropriate parallelism, failure handling, and scheduling. You understand concurrency policies, timing considerations, and production patterns for data pipelines and maintenance automation. Next, we'll synthesize all workload types into a decision framework.