Loading content...
Automated reconciliation is the heartbeat of GitOps. It's the continuous loop that compares the desired state in Git against the live state in the cluster, detects differences, and takes action to converge toward the desired state. Without automated reconciliation, GitOps is just "Git storage"—a repository of manifests that someone must manually apply.
The power of true reconciliation is self-healing infrastructure. If someone accidentally deletes a Deployment, it's recreated. If a ConfigMap is manually modified, it's reverted. If a new node joins the cluster, all required resources are applied. The cluster continuously converges toward the canonical state defined in Git.
This page dissects the reconciliation process in detail. You'll understand how drift is detected, the strategies for resolving drift, how health assessments work, how to handle failures gracefully, and the performance and operational considerations for running reconciliation at scale.
The reconciliation loop follows a consistent pattern regardless of which GitOps tool you use. Understanding these mechanics is essential for troubleshooting, optimizing, and designing effective GitOps workflows.
Phase-by-Phase Breakdown:
Phase 1: Source Acquisition The operator fetches the latest source artifacts:
Sources are cached to minimize network overhead, with cache invalidation based on commit SHA, chart version, or revision.
Phase 2: Manifest Generation Raw sources are transformed into Kubernetes manifests:
${CLUSTER_NAME} are substitutedThis phase is critical for performance—rendering thousands of manifests can be CPU-intensive.
Phase 3: State Comparison The operator compares desired state against live state:
Phase 4: Apply Changes Differences are applied to the cluster:
Phase 5: Health Assessment After applying, the operator verifies convergence:
Modern GitOps tools use Kubernetes Server-Side Apply, which tracks field ownership. This prevents conflicts when multiple controllers manage different fields of the same resource (e.g., GitOps manages spec while HPA manages status.replicas). SSA is essential for coexistence with other controllers.
Drift occurs when the live cluster state diverges from the desired state in Git. Detecting and handling drift correctly is one of the most nuanced aspects of GitOps operations.
Sources of Drift:
| Drift Source | Example | Handling Strategy |
|---|---|---|
| Manual kubectl changes | kubectl edit deployment | Auto-revert (self-heal) |
| Emergency hotfixes | kubectl set image... | Alert, investigate, commit fix to Git if valid |
| Controller modifications | HPA changes replicas | Ignore specific fields |
| Webhook mutations | Admission controller adds labels | Ignore injected fields |
| Expired resources | Certificate renewed with new data | Exclude dynamic fields |
| Failed rollouts | Deployment stuck at partial rollout | Force sync or manual intervention |
| Orphaned resources | Resource exists but not in Git | Prune or investigate ownership |
The Diff Algorithm:
GitOps tools compare resources using structured diffs, not simple text comparison. The process:
resourceVersion, uid, timestamps)InSync, OutOfSync, UnknownConfiguring Ignore Rules:
Both ArgoCD and Flux allow you to ignore specific fields during drift detection. This is essential for fields managed by other controllers:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# ArgoCD: ignoreDifferences in Application specapiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: my-appspec: source: repoURL: https://github.com/company/infra.git path: apps/my-app destination: server: https://kubernetes.default.svc namespace: my-app # Ignore specific fields during diff ignoreDifferences: # Ignore replicas for HPA-managed deployments - group: apps kind: Deployment jsonPointers: - /spec/replicas # Ignore webhook-injected labels - group: "" kind: Service jsonPointers: - /metadata/labels/webhook-injected # Ignore all fields matching pattern - group: apps kind: Deployment managedFieldsManagers: - kube-controller-manager jqPathExpressions: - .spec.template.spec.containers[].resources ---# Global ignore patterns (in argocd-cm ConfigMap)apiVersion: v1kind: ConfigMapmetadata: name: argocd-cm namespace: argocddata: resource.customizations.ignoreDifferences.all: | jsonPointers: - /metadata/annotations/kubectl.kubernetes.io~1last-applied-configurationAggressive self-healing can fight legitimate controllers. If HPA scales pods to 10 but Git says 3, strict self-heal would revert to 3—defeating autoscaling. The solution is precise ignore rules, not disabling self-heal entirely. Self-heal is a powerful safety net; configure exceptions judiciously.
How and when reconciliation occurs is controlled by sync policies. Different environments require different strategies—production might need manual approval, while development can auto-sync continuously.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
# ArgoCD Sync Policy ConfigurationapiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: production-appspec: source: repoURL: https://github.com/company/infra.git path: apps/my-app/production destination: server: https://kubernetes.default.svc # Comprehensive sync policy syncPolicy: # Automated sync (remove for manual approval) automated: prune: true # Delete resources not in Git selfHeal: true # Revert manual changes allowEmpty: false # Prevent deletion of all resources # Sync options control behavior syncOptions: - CreateNamespace=true # Create ns if missing - PrunePropagationPolicy=foreground # Wait for cascade - PruneLast=true # Prune after apply - ApplyOutOfSyncOnly=true # Only apply changed - ServerSideApply=true # Use SSA for conflicts - RespectIgnoreDifferences=true # Retry policy for transient failures retry: limit: 5 backoff: duration: 5s factor: 2 maxDuration: 3m ---# Sync Windows: Control when syncs are allowedapiVersion: argoproj.io/v1alpha1kind: AppProjectmetadata: name: production namespace: argocdspec: syncWindows: # Allow syncs only during business hours - kind: allow schedule: "0 9 * * 1-5" # Mon-Fri 9AM duration: 10h applications: - "*" # Block all syncs during change freeze - kind: deny schedule: "0 0 20-31 12 *" # Dec 20-31 duration: 24h applications: - "*" manualSync: true # Allow manual overrideEnvironment-Specific Policies:
| Environment | Auto-Sync | Self-Heal | Prune | Interval |
|---|---|---|---|---|
| Development | Yes | Yes | Yes | 1m |
| Staging | Yes | Yes | Yes | 5m |
| Production | Conditional | Yes | Yes | 30m |
Production Auto-Sync Considerations:
Many teams disable auto-sync for production, requiring manual approval for each deployment. However, this undermines GitOps's value—if Git is the source of truth, why require additional approval?
A balanced approach:
Applying manifests isn't enough—you need to verify that resources are actually healthy. A Deployment might be created successfully but fail to schedule pods. A Service might exist but have no healthy endpoints. Health assessment closes this gap.
Health States:
GitOps tools typically classify health into states:
Waiting for Health:
Both Flux and ArgoCD can wait for resources to become healthy before proceeding. This is essential for dependency ordering—you don't want applications deployed before the database is ready.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
# Flux: Explicit health checksapiVersion: kustomize.toolkit.fluxcd.io/v1kind: Kustomizationmetadata: name: my-app namespace: flux-systemspec: interval: 10m sourceRef: kind: GitRepository name: infra path: ./apps/my-app prune: true # Wait for resources to be ready wait: true timeout: 10m # Explicit health checks (in addition to automatic) healthChecks: - apiVersion: apps/v1 kind: Deployment name: my-app namespace: my-app - apiVersion: apps/v1 kind: StatefulSet name: database namespace: my-app # Custom health for CRDs with status conditions - apiVersion: databases.example.com/v1 kind: PostgreSQL name: my-postgres namespace: my-app ---# Dependency: Wait for infra before appsapiVersion: kustomize.toolkit.fluxcd.io/v1kind: Kustomizationmetadata: name: apps namespace: flux-systemspec: dependsOn: - name: infrastructure # Must be healthy first # ... rest of specFor custom resources (operators), health is derived from status conditions. Ensure your CRDs expose standard conditions (Ready, Available, Progressing) in their status. ArgoCD allows custom health check scripts in Lua for complex health logic.
Reconciliation failures are inevitable. Networks fail, images don't exist, resource limits are exceeded, and configuration errors slip through code review. Robust failure handling is essential for operational stability.
| Failure Type | Symptoms | Mitigation |
|---|---|---|
| Git connection failure | Cannot pull source | Retry with backoff; alert after threshold |
| Invalid manifests | YAML parse errors | Validate in CI before merge; reject bad commits |
| Missing dependencies | CRD not found | Order syncs correctly; dependencies first |
| Image pull failure | ImagePullBackOff | Verify image exists; check registry credentials |
| Resource quota exceeded | Forbidden: exceeded quota | Adjust quotas or resource requests |
| Insufficient node resources | Pending pods | Scale cluster or reduce requests |
| Admission webhook rejection | Denied by policy | Fix policy violation in manifests |
| Stuck rollout | Deployment not progressing | Investigate pods; may need manual intervention |
| Degraded health | Partial availability | Scale up; investigate failing instances |
Retry Strategies:
GitOps operators implement retry with exponential backoff to handle transient failures:
Attempt 1: Wait 5s
Attempt 2: Wait 10s
Attempt 3: Wait 20s
Attempt 4: Wait 40s
Attempt 5: Wait 60s (capped)
After max retries: Alert, mark as failed
Partial Failure Handling:
When reconciliation partially succeeds:
Rollback Strategies:
GitOps rollback is straightforward—revert the Git commit:
# Identify the bad commit
git log --oneline
# Revert the change
git revert HEAD
git push
# GitOps operator automatically syncs the reverted state
For more targeted rollback, reset to a known-good tag:
git checkout tags/v1.4.2 -- apps/my-app/
git commit -m "Rollback my-app to v1.4.2"
git push
12345678910111213141516171819202122232425262728293031323334353637383940414243
# ArgoCD notification configurationapiVersion: v1kind: ConfigMapmetadata: name: argocd-notifications-cm namespace: argocddata: # Slack notification template template.app-sync-failed: | message: | 🚨 *Application Sync Failed* Application: {{.app.metadata.name}} Status: {{.app.status.sync.status}} Revision: {{.app.status.sync.revision}} Error: {{.app.status.conditions | first | jsonPath ".message"}} slack: attachments: | [{ "color": "#FF0000", "title": "{{.app.metadata.name}}", "fields": [ {"title": "Sync Status", "value": "{{.app.status.sync.status}}", "short": true}, {"title": "Health", "value": "{{.app.status.health.status}}", "short": true} ] }] # Define triggers trigger.on-sync-failed: | - send: [app-sync-failed] when: app.status.sync.status == 'Failed' or app.status.operationState.phase == 'Failed' # Slack service configuration service.slack: | token: $slack-token ---# Apply trigger to applicationsapiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: my-app annotations: notifications.argoproj.io/subscribe.on-sync-failed.slack: "#platform-alerts"Configure alerts for actionable events only. Alerting on every reconciliation creates noise that teams learn to ignore. Alert on failures, degraded health, and drift—not routine syncs. Use severity levels to distinguish urgent issues from informational events.
Reconciliation at scale—thousands of resources across hundreds of applications—requires attention to performance. Without optimization, reconciliation can become a bottleneck, causing delayed deployments and increased cluster load.
1234567891011121314151617181920212223242526272829303132333435363738
# ArgoCD Application Controller tuningapiVersion: apps/v1kind: Deploymentmetadata: name: argocd-application-controller namespace: argocdspec: template: spec: containers: - name: argocd-application-controller resources: requests: cpu: 500m memory: 1Gi limits: cpu: 2 memory: 4Gi env: # Concurrent sync operations - name: ARGOCD_APPLICATION_CONTROLLER_OPERATION_PROCESSORS value: "25" # Concurrent status processors - name: ARGOCD_APPLICATION_CONTROLLER_STATUS_PROCESSORS value: "50" # Git fetch parallelism - name: ARGOCD_APPLICATION_CONTROLLER_REPO_SERVER_PARALLELISM_LIMIT value: "10" # Reduce reconciliation frequency for development - name: ARGOCD_APPLICATION_CONTROLLER_HARD_RECONCILIATION_TIMEOUT value: "2h" # How often to force full reconciliation # Self-heal timeout - name: ARGOCD_APPLICATION_CONTROLLER_SELF_HEAL_TIMEOUT_SECONDS value: "5"Metrics and Observability:
Both ArgoCD and Flux expose Prometheus metrics for monitoring reconciliation performance:
Dashboards for these metrics help identify:
Shorter intervals (1 minute) mean faster drift correction but higher load. Longer intervals (30 minutes) reduce load but delay deployments. Use webhooks to trigger immediate reconciliation on Git push, allowing longer baseline intervals without sacrificing responsiveness.
Automated reconciliation is the engine that turns your Git repository into a live, self-healing cluster. Let's consolidate the essential concepts:
What's Next:
With reconciliation mastered, the final page brings everything together with GitOps for Kubernetes—the comprehensive guide to implementing GitOps specifically for Kubernetes workloads, including practical patterns, best practices, and real-world implementation strategies.
You now deeply understand automated reconciliation—how GitOps operators continuously ensure your cluster matches your Git repository. Next, we'll synthesize everything into a comprehensive GitOps for Kubernetes implementation guide.