GitOps with ArgoCD for SRE: Drift Detection, Sync Windows, and Kustomize That Actually Work in Prod

Your cluster is lying to you right now. Not maliciously — it just diverged from what your Git repo says it should be. Someone patched a ConfigMap directly with kubectl edit. A developer scaled a Deployment by hand and forgot to push the change. A Helm hook ran and nobody noticed the resource it left behind.

This is config drift, and it’s the silent killer of on-call peace. You build a system, document it in Git, and slowly — kubectl command by kubectl command — the cluster becomes an undocumented snowflake that only half the team understands.

ArgoCD solves this. Not partially. Fundamentally. But only if you actually understand what it’s doing and configure it properly for SRE work. This article is for teams that have moved past "ArgoCD is installed and apps deploy" and need to know how to make it production-grade: drift detection that alerts before it’s a problem, sync windows that protect your change freeze, and Kustomize overlays that don’t turn into a maintenance nightmare.

GitHub: https://github.com/argoproj/argo-cd


Why Drift Happens and Why You Should Care

When you’re operating Kubernetes at any meaningful scale, the gap between "what Git says" and "what’s running" is almost always non-zero. The usual culprits:

  • Emergency hotfixes applied directly to prod and never backported
  • Operators or controllers mutating resources (Karpenter, cert-manager, Istio sidecars)
  • Failed rollbacks that got stuck halfway
  • Someone using kubectl replace thinking it wouldn’t matter

From an SRE perspective, drift is technical debt that converts into incidents. When the cluster state is unknown, your runbooks are worthless, your disaster recovery assumptions are wrong, and your MTTR goes up every time something breaks.

ArgoCD makes the desired state explicit, observable, and enforceable. It doesn’t prevent all drift — some of it is intentional and fine — but it makes drift visible and actionable.


ArgoCD Architecture in 30 Seconds

ArgoCD runs inside your cluster. Its core loop is simple:

  1. Pull manifests from a Git repo (or Helm chart, or Kustomize directory)
  2. Compare them against live cluster state
  3. Report differences (drift) or automatically reconcile them

The key components you’ll interact with as an SRE:

  • Application — the CR that ties a Git source to a cluster destination
  • AppProject — RBAC and policy boundary (which repos, which clusters, which namespaces)
  • ApplicationSet — generates multiple Applications from a template (invaluable for multi-cluster)

Don’t run ArgoCD without RBAC configured. The default admin account with no project boundaries will cause problems on any team larger than two people.


Installing ArgoCD the Right Way

Skip the kubectl apply -f install.yaml from the docs if you’re going to production. You want to manage ArgoCD itself with ArgoCD (app of apps pattern), and you want the install to be reproducible.

Kustomize-based install:

argocd/
├── kustomization.yaml
├── namespace.yaml
└── patches/
    ├── argocd-cmd-params-cm.yaml
    └── argocd-server-deployment.yaml

kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: argocd
resources:
  - namespace.yaml
  # Pin to a specific version — never use latest in prod
  - https://raw.githubusercontent.com/argoproj/argo-cd/v2.13.3/manifests/install.yaml

patches:
  - path: patches/argocd-cmd-params-cm.yaml
  - path: patches/argocd-server-deployment.yaml

patches/argocd-cmd-params-cm.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  namespace: argocd
data:
  # Disable insecure mode — always terminate TLS properly
  server.insecure: "false"
  # Increase repo polling interval if you have many apps hammering Git
  application.instanceLabelKey: "argocd.argoproj.io/app-name"
  # Enable status badge and metrics
  server.enable.prometheus.metrics: "true"

Gotcha: The default install uses --insecure in many tutorials to skip TLS termination at ArgoCD’s own server. Don’t do this in prod. Put a proper ingress with cert-manager in front of it.


Drift Detection: Understanding What ArgoCD Actually Compares

ArgoCD uses a three-way diff: desired state (Git), live state (cluster), and last applied state. This is the same approach as kubectl apply, but ArgoCD makes the result persistent and observable.

The sync status you’ll see:

  • Synced — live state matches desired state
  • OutOfSync — there’s a diff; ArgoCD won’t necessarily fix it automatically
  • Unknown — ArgoCD can’t reach the cluster or repo

The health status is separate:

  • Healthy — workloads are running and ready
  • Degraded — something is broken at the resource level
  • Progressing — rollout in flight
  • Suspended — sync is paused

These two axes tell different stories. An app can be Synced but Degraded (cluster matches Git, but the Deployment is crashing). It can also be OutOfSync but Healthy (someone manually scaled a Deployment up — cluster is fine, but it doesn’t match Git).

Configuring Drift Detection Properly

By default, ArgoCD rescans every 3 minutes. For SRE use you want to be more deliberate:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments-api
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/your-org/k8s-manifests
    targetRevision: main
    path: apps/payments-api/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: payments
  syncPolicy:
    # Do NOT enable automated sync without thinking about it
    automated:
      prune: false        # Never delete resources automatically in prod
      selfHeal: false     # Drift alerts, but humans decide when to fix
    syncOptions:
      - CreateNamespace=false   # Namespaces should be pre-created explicitly
      - PrunePropagationPolicy=foreground
      - RespectIgnoreDifferences=true
  ignoreDifferences:
    # Ignore fields mutated by controllers you trust
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas    # HPA manages this, not Git
    - group: ""
      kind: Secret
      name: payments-api-tls
      jsonPointers:
        - /data             # cert-manager manages cert data

The ignoreDifferences block is critical. Without it, HPA will constantly fight ArgoCD over replica counts, and cert-manager will always show your TLS secrets as OutOfSync. You’ll get alert fatigue and stop trusting your drift notifications.

Gotcha: selfHeal: true in production without sync windows and proper testing is dangerous. If Git has a bad commit, ArgoCD will actively push it to prod. Start with selfHeal: false and add it only after your team has confidence in the pipeline.

Making Drift Visible

ArgoCD exposes Prometheus metrics. The ones that matter for SRE alerting:

# Prometheus alerting rules for ArgoCD
groups:
  - name: argocd-drift
    rules:
      - alert: ArgoCDAppOutOfSync
        expr: |
          argocd_app_info{sync_status="OutOfSync"} == 1
        for: 15m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "App {{ $labels.name }} has been drifted for 15+ minutes"
          description: |
            Application {{ $labels.name }} in project {{ $labels.project }}
            is OutOfSync. Check ArgoCD UI or run:
            argocd app diff {{ $labels.name }}

      - alert: ArgoCDAppDegraded
        expr: |
          argocd_app_info{health_status="Degraded"} == 1
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "App {{ $labels.name }} is Degraded"

      - alert: ArgoCDSyncFailed
        expr: |
          argocd_app_info{sync_status="Unknown"} == 1
        for: 10m
        labels:
          severity: warning

15 minutes before alerting on OutOfSync gives ArgoCD time to finish a sync cycle and avoids noisy alerts from short transient states. Adjust based on your reconciliation interval.


Sync Windows: Protecting Your Change Freeze

Sync windows are AppProject-level gates that control when syncs can happen. This is exactly what you need for:

  • Change freezes before major releases
  • Protecting prod during business-critical hours
  • Enforcing a release schedule

Sync windows live on the AppProject, not the Application. This is intentional — it’s a policy concern, not a per-app configuration.

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: production
  namespace: argocd
spec:
  description: Production workloads
  sourceRepos:
    - 'https://github.com/your-org/k8s-manifests'
  destinations:
    - namespace: '*'
      server: https://kubernetes.default.svc
  clusterResourceWhitelist:
    - group: '*'
      kind: '*'
  syncWindows:
    # Allow deployments only during business hours on weekdays
    - kind: allow
      schedule: "0 8 * * 1-5"   # 08:00 Mon-Fri
      duration: 10h              # window closes at 18:00
      applications:
        - '*'
      manualSync: true           # Humans can still trigger inside window
      timeZone: "Europe/Tallinn"

    # Always allow emergency manual syncs for critical apps
    - kind: allow
      schedule: "0 0 * * *"
      duration: 24h
      applications:
        - "payments-api"
        - "auth-service"
      namespaces:
        - payments
        - auth
      manualSync: true

    # Block everything during release freeze
    - kind: deny
      schedule: "0 0 * * 5"    # Friday midnight
      duration: 72h             # Block through the weekend
      applications:
        - '*'
      manualSync: false         # No one bypasses this, not even manually
      timeZone: "Europe/Tallinn"

The deny window overrides allow windows. This is the correct priority order for a change freeze — you don’t want someone accidentally scheduling a deploy that slips through because an allow window overlaps with the deny window.

Gotcha: manualSync: false on a deny window means ArgoCD also blocks manual syncs triggered from the UI or CLI. If you have an incident during the freeze and need to deploy a fix, you’ll need to either delete the window temporarily or use argocd app sync --override-sync-window. Plan your incident response procedure around this before you implement it.

Gotcha: Sync windows do not affect selfHeal. If self-heal is enabled and a window is active, ArgoCD will still attempt to reconcile drift. This is a well-known footgun. If you’re using self-heal, your automated behavior is not fully gated by sync windows.

Checking Window Status from CLI

# See current window state for an app
argocd app get payments-api --show-operation

# Check which windows are active right now
argocd proj windows list production

# Force sync bypassing an allow window that's currently closed
argocd app sync payments-api --override-sync-window

Kustomize: The Right Way to Manage Multi-Env Configs

Kustomize is built into ArgoCD — no extra tooling needed. But how you structure your overlays determines whether Kustomize stays maintainable or becomes a layered mess of patches on patches.

The structure that works at scale:

k8s-manifests/
├── apps/
│   └── payments-api/
│       ├── base/
│       │   ├── kustomization.yaml
│       │   ├── deployment.yaml
│       │   ├── service.yaml
│       │   └── configmap.yaml
│       └── overlays/
│           ├── staging/
│           │   ├── kustomization.yaml
│           │   └── patches/
│           │       └── deployment-resources.yaml
│           └── prod/
│               ├── kustomization.yaml
│               └── patches/
│                   ├── deployment-resources.yaml
│                   └── deployment-replicas.yaml

apps/payments-api/base/kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - deployment.yaml
  - service.yaml
  - configmap.yaml

commonLabels:
  app.kubernetes.io/name: payments-api
  app.kubernetes.io/managed-by: argocd

apps/payments-api/base/deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  # Base replica count — overlays will patch this for prod
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: payments-api
  template:
    metadata:
      labels:
        app.kubernetes.io/name: payments-api
    spec:
      containers:
        - name: payments-api
          image: your-registry/payments-api:latest  # image tag injected by CI
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          env:
            - name: ENV
              valueFrom:
                configMapKeyRef:
                  name: payments-api-config
                  key: ENV

apps/payments-api/overlays/prod/kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: payments

resources:
  - ../../base

# Replace image tag with specific prod version
images:
  - name: your-registry/payments-api
    newTag: "v1.42.0"  # CI updates this field via kustomize edit set image

patches:
  - path: patches/deployment-resources.yaml
  - path: patches/deployment-replicas.yaml

# Prod-specific config values
configMapGenerator:
  - name: payments-api-config
    behavior: merge
    literals:
      - ENV=production
      - LOG_LEVEL=info
      - DB_HOST=payments-db.prod.svc.cluster.local

apps/payments-api/overlays/prod/patches/deployment-resources.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  template:
    spec:
      containers:
        - name: payments-api
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: 2000m
              memory: 1Gi

apps/payments-api/overlays/prod/patches/deployment-replicas.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 3

Updating Image Tags in CI

The correct way to update image tags without touching Git history manually:

#!/bin/bash
# ci-update-image.sh — runs in your CI pipeline after image push

set -euo pipefail

APP=$1          # e.g., payments-api
ENV=$2          # e.g., prod
NEW_TAG=$3      # e.g., v1.43.0

OVERLAY_PATH="apps/${APP}/overlays/${ENV}"
IMAGE_NAME="your-registry/${APP}"

cd k8s-manifests
git config user.email "[email protected]"
git config user.name "CI Bot"

kustomize edit set image \
  "${IMAGE_NAME}=${IMAGE_NAME}:${NEW_TAG}" \
  --kustomization "${OVERLAY_PATH}/kustomization.yaml"

git add "${OVERLAY_PATH}/kustomization.yaml"
git commit -m "chore(${APP}): bump ${ENV} image to ${NEW_TAG} [ci skip]"
git push origin main

ArgoCD picks up the commit and syncs based on your sync policy and windows. No webhooks needed — ArgoCD polls. If you need faster propagation, configure a webhook from your Git provider to ArgoCD’s /api/webhook endpoint.

Gotcha: The [ci skip] in the commit message prevents CI from triggering again on its own commit if your pipeline triggers on pushes to main. Without this, you get a loop.

Gotcha: Don’t use configMapGenerator with behavior: replace in overlays unless you know what you’re doing. Replace drops all base keys and only keeps overlay keys. merge is almost always what you want. I’ve seen this cause production outages when someone thought they were adding a key but actually wiped the config.


The App of Apps Pattern for SRE Teams

Instead of creating Applications one by one in the ArgoCD UI, define them in Git too. One root Application that manages all other Applications:

# Root app — the only one you create manually
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-apps
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: platform
  source:
    repoURL: https://github.com/your-org/k8s-manifests
    targetRevision: main
    path: argocd/apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Every file in argocd/apps/ is an ArgoCD Application manifest. Adding a new app to the cluster is a Git commit. Removing one is also a Git commit, and ArgoCD handles cleanup via the finalizer.

This pattern also makes your disaster recovery story much cleaner. Rebuild the cluster, install ArgoCD, apply the root app, wait 5 minutes — cluster is back to the desired state.


Production Checklist Before You Call It Done

Before handing this to your SRE team, run through these:

Security:

  • ArgoCD itself managed by ArgoCD (app of apps)
  • SSO configured (Dex with your OIDC provider, not local users)
  • AppProject per team with scoped repo and namespace access
  • RBAC roles: read-only for devs, sync for team leads, admin only for platform

Reliability:

  • ArgoCD HA mode if you’re running multiple clusters (repo-server and application-controller scaled)
  • Redis with persistence for caching repo state
  • Separate AppProject for ArgoCD’s own components

Observability:

  • Prometheus metrics scraped from argocd-metrics:8082 and argocd-server-metrics:8083
  • Grafana dashboard (official one is solid, fork and customize)
  • Alerting rules for OutOfSync, Degraded, and sync failures as shown above

Operations:

  • Sync windows documented in your runbooks
  • Runbook entry for "how to force-sync during a change freeze"
  • ignoreDifferences covering all controller-managed fields

Where People Go Wrong

The biggest mistake I see teams make: enabling automated.selfHeal: true and prune: true in production without sync windows, then being surprised when a bad commit auto-deploys and auto-prunes resources.

The second biggest: using ArgoCD for config management without agreeing on a branching strategy. If your team pushes to main without review, ArgoCD will deploy garbage just as efficiently as it deploys good code.

ArgoCD is not a safety net. It’s a force multiplier. It makes your deployment process faster, more auditable, and more reliable — but only if the process feeding it is disciplined.

The drift detection value comes from the alerting discipline around it. If your team gets an OutOfSync alert and the response is "yeah, someone scaled it, ignore it" — you’ve already lost the benefit. Fix that by making the ignoreDifferences config accurate enough that every drift alert is actually actionable.


Sync windows + Kustomize overlays + proper drift alerting is the combination that takes ArgoCD from "cool tool" to "actually running our production safely." Start with drift alerting (lowest risk, immediate value), add Kustomize overlays when you have more than two environments, and gate sync windows only after your team has used ArgoCD for a month and understands the operational implications of blocking a sync.

Leave a comment

👁 Views: 2,289 · Unique visitors: 1,646