GitOps - Learning Module

Loading content...

0/273

Automated Reconciliation

The Self-Healing Cluster

Automated reconciliation is the heartbeat of GitOps. It's the continuous loop that compares the desired state in Git against the live state in the cluster, detects differences, and takes action to converge toward the desired state. Without automated reconciliation, GitOps is just "Git storage"—a repository of manifests that someone must manually apply.

The power of true reconciliation is self-healing infrastructure. If someone accidentally deletes a Deployment, it's recreated. If a ConfigMap is manually modified, it's reverted. If a new node joins the cluster, all required resources are applied. The cluster continuously converges toward the canonical state defined in Git.

What You Will Learn

This page dissects the reconciliation process in detail. You'll understand how drift is detected, the strategies for resolving drift, how health assessments work, how to handle failures gracefully, and the performance and operational considerations for running reconciliation at scale.

The Reconciliation Loop Mechanics

The reconciliation loop follows a consistent pattern regardless of which GitOps tool you use. Understanding these mechanics is essential for troubleshooting, optimizing, and designing effective GitOps workflows.

Converting Mermaid diagram...

Phase-by-Phase Breakdown:

Phase 1: Source Acquisition The operator fetches the latest source artifacts:

Clones or pulls from Git repositories
Downloads Helm charts from registries
Fetches OCI artifacts from container registries
Retrieves files from S3-compatible buckets

Sources are cached to minimize network overhead, with cache invalidation based on commit SHA, chart version, or revision.

Phase 2: Manifest Generation Raw sources are transformed into Kubernetes manifests:

Kustomize build processes overlays and patches
Helm templates charts with provided values
Variables like ${CLUSTER_NAME} are substituted
SOPS or sealed secrets are decrypted

This phase is critical for performance—rendering thousands of manifests can be CPU-intensive.

Phase 3: State Comparison The operator compares desired state against live state:

Queries the Kubernetes API for existing resources
Computes a structured diff between desired and live
Classifies changes: create, update, or delete
Applies configured ignore rules (e.g., ignore replica count for HPA-managed deployments)

Phase 4: Apply Changes Differences are applied to the cluster:

Uses server-side apply for better conflict detection
Creates resources that don't exist
Updates resources that differ from desired state
Optionally deletes resources removed from Git (pruning)

Phase 5: Health Assessment After applying, the operator verifies convergence:

Waits for Deployments to roll out
Checks custom health conditions
Updates status CRDs with sync state
Emits events and metrics

Server-Side Apply (SSA)

Modern GitOps tools use Kubernetes Server-Side Apply, which tracks field ownership. This prevents conflicts when multiple controllers manage different fields of the same resource (e.g., GitOps manages spec while HPA manages status.replicas). SSA is essential for coexistence with other controllers.

Drift Detection: Understanding and Handling Divergence

Drift occurs when the live cluster state diverges from the desired state in Git. Detecting and handling drift correctly is one of the most nuanced aspects of GitOps operations.

Sources of Drift:

Common Drift Sources and Their Implications
Drift Source	Example	Handling Strategy
Manual kubectl changes	kubectl edit deployment	Auto-revert (self-heal)
Emergency hotfixes	kubectl set image...	Alert, investigate, commit fix to Git if valid
Controller modifications	HPA changes replicas	Ignore specific fields
Webhook mutations	Admission controller adds labels	Ignore injected fields
Expired resources	Certificate renewed with new data	Exclude dynamic fields
Failed rollouts	Deployment stuck at partial rollout	Force sync or manual intervention
Orphaned resources	Resource exists but not in Git	Prune or investigate ownership

The Diff Algorithm:

GitOps tools compare resources using structured diffs, not simple text comparison. The process:

Normalize both desired and live resources (remove server-generated fields like resourceVersion, uid, timestamps)
Compare field by field, respecting field ownership
Apply ignore rules to exclude known dynamic fields
Classify the result: InSync, OutOfSync, Unknown

Configuring Ignore Rules:

Both ArgoCD and Flux allow you to ignore specific fields during drift detection. This is essential for fields managed by other controllers:

ignore-differences.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# ArgoCD: ignoreDifferences in Application spec
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  source:
    repoURL: https://github.com/company/infra.git
    path: apps/my-app
  destination:
    server: https://kubernetes.default.svc
    namespace: my-app
  
  # Ignore specific fields during diff
  ignoreDifferences:
    # Ignore replicas for HPA-managed deployments
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas
    
    # Ignore webhook-injected labels
    - group: ""
      kind: Service
      jsonPointers:
        - /metadata/labels/webhook-injected
    
    # Ignore all fields matching pattern
    - group: apps
      kind: Deployment
      managedFieldsManagers:
        - kube-controller-manager
      jqPathExpressions:
        - .spec.template.spec.containers[].resources
 
---
# Global ignore patterns (in argocd-cm ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations.ignoreDifferences.all: |
    jsonPointers:
    - /metadata/annotations/kubectl.kubernetes.io~1last-applied-configuration

The Self-Heal Dilemma

Aggressive self-healing can fight legitimate controllers. If HPA scales pods to 10 but Git says 3, strict self-heal would revert to 3—defeating autoscaling. The solution is precise ignore rules, not disabling self-heal entirely. Self-heal is a powerful safety net; configure exceptions judiciously.

Sync Strategies and Policies

How and when reconciliation occurs is controlled by sync policies. Different environments require different strategies—production might need manual approval, while development can auto-sync continuously.

Sync Configuration Options

•Auto-Sync — Automatically apply when Git changes are detected. Ideal for development and staging environments where rapid iteration is valued.
•Manual Sync — Changes are detected but not applied until explicitly triggered. Common for production to require human approval.
•Self-Heal — Automatically revert live changes that diverge from Git. The signature GitOps behavior for preventing drift.
•Prune — Delete resources that exist in the cluster but not in Git. Essential for true declarative management but requires caution.
•Replace — Force resource replacement instead of patch. Useful when immutable fields change but risks disruption.
•Retry — Automatically retry failed syncs with exponential backoff. Handles transient failures gracefully.

sync-policies.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# ArgoCD Sync Policy Configuration
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-app
spec:
  source:
    repoURL: https://github.com/company/infra.git
    path: apps/my-app/production
  destination:
    server: https://kubernetes.default.svc
  
  # Comprehensive sync policy
  syncPolicy:
    # Automated sync (remove for manual approval)
    automated:
      prune: true           # Delete resources not in Git
      selfHeal: true        # Revert manual changes
      allowEmpty: false     # Prevent deletion of all resources
    
    # Sync options control behavior
    syncOptions:
      - CreateNamespace=true        # Create ns if missing
      - PrunePropagationPolicy=foreground  # Wait for cascade
      - PruneLast=true              # Prune after apply
      - ApplyOutOfSyncOnly=true     # Only apply changed
      - ServerSideApply=true        # Use SSA for conflicts
      - RespectIgnoreDifferences=true
    
    # Retry policy for transient failures
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
 
---
# Sync Windows: Control when syncs are allowed
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: production
  namespace: argocd
spec:
  syncWindows:
    # Allow syncs only during business hours
    - kind: allow
      schedule: "0 9 * * 1-5"  # Mon-Fri 9AM
      duration: 10h
      applications:
        - "*"
    
    # Block all syncs during change freeze
    - kind: deny
      schedule: "0 0 20-31 12 *"  # Dec 20-31
      duration: 24h
      applications:
        - "*"
      manualSync: true  # Allow manual override

Environment-Specific Policies:

Environment	Auto-Sync	Self-Heal	Prune	Interval
Development	Yes	Yes	Yes	1m
Staging	Yes	Yes	Yes	5m
Production	Conditional	Yes	Yes	30m

Production Auto-Sync Considerations:

Many teams disable auto-sync for production, requiring manual approval for each deployment. However, this undermines GitOps's value—if Git is the source of truth, why require additional approval?

A balanced approach:

Require approval in Git via pull request reviews
Enable auto-sync from Git once merged
Use sync windows to control timing
Implement progressive delivery for gradual rollouts

Health Assessment and Readiness

Applying manifests isn't enough—you need to verify that resources are actually healthy. A Deployment might be created successfully but fail to schedule pods. A Service might exist but have no healthy endpoints. Health assessment closes this gap.

Built-in Health Checks

•Deployments — Healthy when desired replicas = ready replicas and rollout complete
•StatefulSets — Healthy when all replicas are ready and ordinal positions stable
•DaemonSets — Healthy when desired number scheduled = number ready
•Jobs — Healthy when completed successfully (or still running if not timed out)
•Services — Healthy when endpoints exist (for ClusterIP/NodePort)
•Ingresses — Healthy when Load Balancer IP/hostname assigned (for cloud LBs)
•PersistentVolumeClaims — Healthy when bound to a PV
•Custom Resources — Health derived from status conditions (if present)

Health States:

GitOps tools typically classify health into states:

Healthy — Resource is operating correctly
Progressing — Resource is transitioning (e.g., Deployment rolling out)
Degraded — Resource is partially functional (e.g., 2/3 replicas ready)
Suspended — Resource is intentionally paused
Missing — Expected resource doesn't exist
Unknown — Health cannot be determined

Waiting for Health:

Both Flux and ArgoCD can wait for resources to become healthy before proceeding. This is essential for dependency ordering—you don't want applications deployed before the database is ready.

health-checks.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Flux: Explicit health checks
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
  namespace: flux-system
spec:
  interval: 10m
  sourceRef:
    kind: GitRepository
    name: infra
  path: ./apps/my-app
  prune: true
  
  # Wait for resources to be ready
  wait: true
  timeout: 10m
  
  # Explicit health checks (in addition to automatic)
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: my-app
      namespace: my-app
    - apiVersion: apps/v1
      kind: StatefulSet  
      name: database
      namespace: my-app
    # Custom health for CRDs with status conditions
    - apiVersion: databases.example.com/v1
      kind: PostgreSQL
      name: my-postgres
      namespace: my-app
 
---
# Dependency: Wait for infra before apps
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  dependsOn:
    - name: infrastructure  # Must be healthy first
  # ... rest of spec

Custom Health for CRDs

For custom resources (operators), health is derived from status conditions. Ensure your CRDs expose standard conditions (Ready, Available, Progressing) in their status. ArgoCD allows custom health check scripts in Lua for complex health logic.

Failure Handling and Recovery

Reconciliation failures are inevitable. Networks fail, images don't exist, resource limits are exceeded, and configuration errors slip through code review. Robust failure handling is essential for operational stability.

Common Failure Modes and Mitigations
Failure Type	Symptoms	Mitigation
Git connection failure	Cannot pull source	Retry with backoff; alert after threshold
Invalid manifests	YAML parse errors	Validate in CI before merge; reject bad commits
Missing dependencies	CRD not found	Order syncs correctly; dependencies first
Image pull failure	ImagePullBackOff	Verify image exists; check registry credentials
Resource quota exceeded	Forbidden: exceeded quota	Adjust quotas or resource requests
Insufficient node resources	Pending pods	Scale cluster or reduce requests
Admission webhook rejection	Denied by policy	Fix policy violation in manifests
Stuck rollout	Deployment not progressing	Investigate pods; may need manual intervention
Degraded health	Partial availability	Scale up; investigate failing instances

Retry Strategies:

GitOps operators implement retry with exponential backoff to handle transient failures:

Attempt 1: Wait 5s
Attempt 2: Wait 10s
Attempt 3: Wait 20s
Attempt 4: Wait 40s
Attempt 5: Wait 60s (capped)
After max retries: Alert, mark as failed

Partial Failure Handling:

When reconciliation partially succeeds:

Resources that applied successfully remain
Failed resources are retried on next reconciliation
Status reflects partial sync state
Notifications alert on partial failures

Rollback Strategies:

GitOps rollback is straightforward—revert the Git commit:

# Identify the bad commit
git log --oneline

# Revert the change
git revert HEAD
git push

# GitOps operator automatically syncs the reverted state

For more targeted rollback, reset to a known-good tag:

git checkout tags/v1.4.2 -- apps/my-app/
git commit -m "Rollback my-app to v1.4.2"
git push

failure-alerting.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# ArgoCD notification configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  # Slack notification template
  template.app-sync-failed: |
    message: |
      🚨 *Application Sync Failed*
      Application: {{.app.metadata.name}}
      Status: {{.app.status.sync.status}}
      Revision: {{.app.status.sync.revision}}
      Error: {{.app.status.conditions | first | jsonPath ".message"}}
    slack:
      attachments: |
        [{
          "color": "#FF0000",
          "title": "{{.app.metadata.name}}",
          "fields": [
            {"title": "Sync Status", "value": "{{.app.status.sync.status}}", "short": true},
            {"title": "Health", "value": "{{.app.status.health.status}}", "short": true}
          ]
        }]
 
  # Define triggers
  trigger.on-sync-failed: |
    - send: [app-sync-failed]
      when: app.status.sync.status == 'Failed' or app.status.operationState.phase == 'Failed'
 
  # Slack service configuration
  service.slack: |
    token: $slack-token
    
---
# Apply trigger to applications
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  annotations:
    notifications.argoproj.io/subscribe.on-sync-failed.slack: "#platform-alerts"

Avoid Alert Fatigue

Configure alerts for actionable events only. Alerting on every reconciliation creates noise that teams learn to ignore. Alert on failures, degraded health, and drift—not routine syncs. Use severity levels to distinguish urgent issues from informational events.

Performance and Scaling Considerations

Reconciliation at scale—thousands of resources across hundreds of applications—requires attention to performance. Without optimization, reconciliation can become a bottleneck, causing delayed deployments and increased cluster load.

Performance Optimization Strategies

•ApplyOutOfSyncOnly — Only apply resources that differ from live state, skipping unchanged resources. Dramatically reduces API server load.
•Source Caching — Cache Git clones and Helm charts. Only re-fetch when revision changes. Both tools implement this by default.
•Manifest Caching — Cache rendered manifests to avoid re-rendering on every reconciliation. Flux caches artifacts; ArgoCD's repo-server caches renders.
•Parallel Reconciliation — Process multiple applications/kustomizations concurrently. Configure controller concurrency settings.
•Selective Watching — Watch only namespaces or resources your applications care about, reducing controller memory usage.
•Resource Limits — Set appropriate CPU/memory limits on controllers. Under-provisioned controllers throttle and slow reconciliation.
•Shard Controllers — Run multiple controller instances, each managing a subset of applications. Essential for very large deployments.

performance-tuning.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# ArgoCD Application Controller tuning
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-application-controller
  namespace: argocd
spec:
  template:
    spec:
      containers:
        - name: argocd-application-controller
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 2
              memory: 4Gi
          env:
            # Concurrent sync operations
            - name: ARGOCD_APPLICATION_CONTROLLER_OPERATION_PROCESSORS
              value: "25"
            
            # Concurrent status processors
            - name: ARGOCD_APPLICATION_CONTROLLER_STATUS_PROCESSORS
              value: "50"
            
            # Git fetch parallelism
            - name: ARGOCD_APPLICATION_CONTROLLER_REPO_SERVER_PARALLELISM_LIMIT
              value: "10"
            
            # Reduce reconciliation frequency for development
            - name: ARGOCD_APPLICATION_CONTROLLER_HARD_RECONCILIATION_TIMEOUT
              value: "2h"  # How often to force full reconciliation
            
            # Self-heal timeout
            - name: ARGOCD_APPLICATION_CONTROLLER_SELF_HEAL_TIMEOUT_SECONDS
              value: "5"

Metrics and Observability:

Both ArgoCD and Flux expose Prometheus metrics for monitoring reconciliation performance:

argocd_app_reconcile_duration_seconds — Time to reconcile each application
argocd_app_sync_total — Number of sync operations (success/failure)
gotk_reconcile_duration_seconds (Flux) — Reconciliation latency
gotk_reconcile_condition (Flux) — Current condition of controllers

Dashboards for these metrics help identify:

Slow applications that need manifest optimization
Controllers hitting resource limits
Source fetching bottlenecks
Drift detection anomalies

Reconciliation Intervals Are Trade-offs

Shorter intervals (1 minute) mean faster drift correction but higher load. Longer intervals (30 minutes) reduce load but delay deployments. Use webhooks to trigger immediate reconciliation on Git push, allowing longer baseline intervals without sacrificing responsiveness.

Summary: Mastering Automated Reconciliation

Automated reconciliation is the engine that turns your Git repository into a live, self-healing cluster. Let's consolidate the essential concepts:

Reconciliation Key Takeaways

•The reconciliation loop is continuous — Source acquisition, manifest generation, state comparison, change application, and health assessment run perpetually.
•Drift detection requires configuration — Use ignore rules to prevent conflicts with HPA, admission webhooks, and other controllers.
•Sync policies vary by environment — Auto-sync for development, potentially gated for production, with self-heal enabled for drift correction.
•Health assessment ensures convergence — Don't just apply manifests; verify resources are healthy with explicit health checks and dependency ordering.
•Failure handling must be robust — Implement retry with backoff, alerting on failures, and clear rollback procedures.
•Performance requires attention at scale — Tune concurrency, caching, and resource limits as you scale beyond hundreds of applications.

What's Next:

With reconciliation mastered, the final page brings everything together with GitOps for Kubernetes—the comprehensive guide to implementing GitOps specifically for Kubernetes workloads, including practical patterns, best practices, and real-world implementation strategies.

Page Complete

You now deeply understand automated reconciliation—how GitOps operators continuously ensure your cluster matches your Git repository. Next, we'll synthesize everything into a comprehensive GitOps for Kubernetes implementation guide.