Your cluster is lying to you right now. Not maliciously — it just diverged from what your Git repo says it should be. Someone patched a ConfigMap directly with kubectl edit. A developer scaled a Deployment by hand and forgot to push the change. A Helm hook ran and nobody noticed the resource it left behind.
This is config drift, and it’s the silent killer of on-call peace. You build a system, document it in Git, and slowly — kubectl command by kubectl command — the cluster becomes an undocumented snowflake that only half the team understands.
ArgoCD solves this. Not partially. Fundamentally. But only if you actually understand what it’s doing and configure it properly for SRE work. This article is for teams that have moved past "ArgoCD is installed and apps deploy" and need to know how to make it production-grade: drift detection that alerts before it’s a problem, sync windows that protect your change freeze, and Kustomize overlays that don’t turn into a maintenance nightmare.
GitHub: https://github.com/argoproj/argo-cd
Why Drift Happens and Why You Should Care
When you’re operating Kubernetes at any meaningful scale, the gap between "what Git says" and "what’s running" is almost always non-zero. The usual culprits:
- Emergency hotfixes applied directly to prod and never backported
- Operators or controllers mutating resources (Karpenter, cert-manager, Istio sidecars)
- Failed rollbacks that got stuck halfway
- Someone using
kubectl replacethinking it wouldn’t matter
From an SRE perspective, drift is technical debt that converts into incidents. When the cluster state is unknown, your runbooks are worthless, your disaster recovery assumptions are wrong, and your MTTR goes up every time something breaks.
ArgoCD makes the desired state explicit, observable, and enforceable. It doesn’t prevent all drift — some of it is intentional and fine — but it makes drift visible and actionable.
ArgoCD Architecture in 30 Seconds
ArgoCD runs inside your cluster. Its core loop is simple:
- Pull manifests from a Git repo (or Helm chart, or Kustomize directory)
- Compare them against live cluster state
- Report differences (drift) or automatically reconcile them
The key components you’ll interact with as an SRE:
- Application — the CR that ties a Git source to a cluster destination
- AppProject — RBAC and policy boundary (which repos, which clusters, which namespaces)
- ApplicationSet — generates multiple Applications from a template (invaluable for multi-cluster)
Don’t run ArgoCD without RBAC configured. The default admin account with no project boundaries will cause problems on any team larger than two people.
Installing ArgoCD the Right Way
Skip the kubectl apply -f install.yaml from the docs if you’re going to production. You want to manage ArgoCD itself with ArgoCD (app of apps pattern), and you want the install to be reproducible.
Kustomize-based install:
argocd/
├── kustomization.yaml
├── namespace.yaml
└── patches/
├── argocd-cmd-params-cm.yaml
└── argocd-server-deployment.yaml
kustomization.yaml:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: argocd
resources:
- namespace.yaml
# Pin to a specific version — never use latest in prod
- https://raw.githubusercontent.com/argoproj/argo-cd/v2.13.3/manifests/install.yaml
patches:
- path: patches/argocd-cmd-params-cm.yaml
- path: patches/argocd-server-deployment.yaml
patches/argocd-cmd-params-cm.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cmd-params-cm
namespace: argocd
data:
# Disable insecure mode — always terminate TLS properly
server.insecure: "false"
# Increase repo polling interval if you have many apps hammering Git
application.instanceLabelKey: "argocd.argoproj.io/app-name"
# Enable status badge and metrics
server.enable.prometheus.metrics: "true"
Gotcha: The default install uses --insecure in many tutorials to skip TLS termination at ArgoCD’s own server. Don’t do this in prod. Put a proper ingress with cert-manager in front of it.
Drift Detection: Understanding What ArgoCD Actually Compares
ArgoCD uses a three-way diff: desired state (Git), live state (cluster), and last applied state. This is the same approach as kubectl apply, but ArgoCD makes the result persistent and observable.
The sync status you’ll see:
- Synced — live state matches desired state
- OutOfSync — there’s a diff; ArgoCD won’t necessarily fix it automatically
- Unknown — ArgoCD can’t reach the cluster or repo
The health status is separate:
- Healthy — workloads are running and ready
- Degraded — something is broken at the resource level
- Progressing — rollout in flight
- Suspended — sync is paused
These two axes tell different stories. An app can be Synced but Degraded (cluster matches Git, but the Deployment is crashing). It can also be OutOfSync but Healthy (someone manually scaled a Deployment up — cluster is fine, but it doesn’t match Git).
Configuring Drift Detection Properly
By default, ArgoCD rescans every 3 minutes. For SRE use you want to be more deliberate:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-api
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/your-org/k8s-manifests
targetRevision: main
path: apps/payments-api/overlays/prod
destination:
server: https://kubernetes.default.svc
namespace: payments
syncPolicy:
# Do NOT enable automated sync without thinking about it
automated:
prune: false # Never delete resources automatically in prod
selfHeal: false # Drift alerts, but humans decide when to fix
syncOptions:
- CreateNamespace=false # Namespaces should be pre-created explicitly
- PrunePropagationPolicy=foreground
- RespectIgnoreDifferences=true
ignoreDifferences:
# Ignore fields mutated by controllers you trust
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # HPA manages this, not Git
- group: ""
kind: Secret
name: payments-api-tls
jsonPointers:
- /data # cert-manager manages cert data
The ignoreDifferences block is critical. Without it, HPA will constantly fight ArgoCD over replica counts, and cert-manager will always show your TLS secrets as OutOfSync. You’ll get alert fatigue and stop trusting your drift notifications.
Gotcha: selfHeal: true in production without sync windows and proper testing is dangerous. If Git has a bad commit, ArgoCD will actively push it to prod. Start with selfHeal: false and add it only after your team has confidence in the pipeline.
Making Drift Visible
ArgoCD exposes Prometheus metrics. The ones that matter for SRE alerting:
# Prometheus alerting rules for ArgoCD
groups:
- name: argocd-drift
rules:
- alert: ArgoCDAppOutOfSync
expr: |
argocd_app_info{sync_status="OutOfSync"} == 1
for: 15m
labels:
severity: warning
team: platform
annotations:
summary: "App {{ $labels.name }} has been drifted for 15+ minutes"
description: |
Application {{ $labels.name }} in project {{ $labels.project }}
is OutOfSync. Check ArgoCD UI or run:
argocd app diff {{ $labels.name }}
- alert: ArgoCDAppDegraded
expr: |
argocd_app_info{health_status="Degraded"} == 1
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "App {{ $labels.name }} is Degraded"
- alert: ArgoCDSyncFailed
expr: |
argocd_app_info{sync_status="Unknown"} == 1
for: 10m
labels:
severity: warning
15 minutes before alerting on OutOfSync gives ArgoCD time to finish a sync cycle and avoids noisy alerts from short transient states. Adjust based on your reconciliation interval.
Sync Windows: Protecting Your Change Freeze
Sync windows are AppProject-level gates that control when syncs can happen. This is exactly what you need for:
- Change freezes before major releases
- Protecting prod during business-critical hours
- Enforcing a release schedule
Sync windows live on the AppProject, not the Application. This is intentional — it’s a policy concern, not a per-app configuration.
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: production
namespace: argocd
spec:
description: Production workloads
sourceRepos:
- 'https://github.com/your-org/k8s-manifests'
destinations:
- namespace: '*'
server: https://kubernetes.default.svc
clusterResourceWhitelist:
- group: '*'
kind: '*'
syncWindows:
# Allow deployments only during business hours on weekdays
- kind: allow
schedule: "0 8 * * 1-5" # 08:00 Mon-Fri
duration: 10h # window closes at 18:00
applications:
- '*'
manualSync: true # Humans can still trigger inside window
timeZone: "Europe/Tallinn"
# Always allow emergency manual syncs for critical apps
- kind: allow
schedule: "0 0 * * *"
duration: 24h
applications:
- "payments-api"
- "auth-service"
namespaces:
- payments
- auth
manualSync: true
# Block everything during release freeze
- kind: deny
schedule: "0 0 * * 5" # Friday midnight
duration: 72h # Block through the weekend
applications:
- '*'
manualSync: false # No one bypasses this, not even manually
timeZone: "Europe/Tallinn"
The deny window overrides allow windows. This is the correct priority order for a change freeze — you don’t want someone accidentally scheduling a deploy that slips through because an allow window overlaps with the deny window.
Gotcha: manualSync: false on a deny window means ArgoCD also blocks manual syncs triggered from the UI or CLI. If you have an incident during the freeze and need to deploy a fix, you’ll need to either delete the window temporarily or use argocd app sync --override-sync-window. Plan your incident response procedure around this before you implement it.
Gotcha: Sync windows do not affect selfHeal. If self-heal is enabled and a window is active, ArgoCD will still attempt to reconcile drift. This is a well-known footgun. If you’re using self-heal, your automated behavior is not fully gated by sync windows.
Checking Window Status from CLI
# See current window state for an app
argocd app get payments-api --show-operation
# Check which windows are active right now
argocd proj windows list production
# Force sync bypassing an allow window that's currently closed
argocd app sync payments-api --override-sync-window
Kustomize: The Right Way to Manage Multi-Env Configs
Kustomize is built into ArgoCD — no extra tooling needed. But how you structure your overlays determines whether Kustomize stays maintainable or becomes a layered mess of patches on patches.
The structure that works at scale:
k8s-manifests/
├── apps/
│ └── payments-api/
│ ├── base/
│ │ ├── kustomization.yaml
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ └── configmap.yaml
│ └── overlays/
│ ├── staging/
│ │ ├── kustomization.yaml
│ │ └── patches/
│ │ └── deployment-resources.yaml
│ └── prod/
│ ├── kustomization.yaml
│ └── patches/
│ ├── deployment-resources.yaml
│ └── deployment-replicas.yaml
apps/payments-api/base/kustomization.yaml:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- configmap.yaml
commonLabels:
app.kubernetes.io/name: payments-api
app.kubernetes.io/managed-by: argocd
apps/payments-api/base/deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
# Base replica count — overlays will patch this for prod
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: payments-api
template:
metadata:
labels:
app.kubernetes.io/name: payments-api
spec:
containers:
- name: payments-api
image: your-registry/payments-api:latest # image tag injected by CI
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
env:
- name: ENV
valueFrom:
configMapKeyRef:
name: payments-api-config
key: ENV
apps/payments-api/overlays/prod/kustomization.yaml:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: payments
resources:
- ../../base
# Replace image tag with specific prod version
images:
- name: your-registry/payments-api
newTag: "v1.42.0" # CI updates this field via kustomize edit set image
patches:
- path: patches/deployment-resources.yaml
- path: patches/deployment-replicas.yaml
# Prod-specific config values
configMapGenerator:
- name: payments-api-config
behavior: merge
literals:
- ENV=production
- LOG_LEVEL=info
- DB_HOST=payments-db.prod.svc.cluster.local
apps/payments-api/overlays/prod/patches/deployment-resources.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
template:
spec:
containers:
- name: payments-api
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 1Gi
apps/payments-api/overlays/prod/patches/deployment-replicas.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
replicas: 3
Updating Image Tags in CI
The correct way to update image tags without touching Git history manually:
#!/bin/bash
# ci-update-image.sh — runs in your CI pipeline after image push
set -euo pipefail
APP=$1 # e.g., payments-api
ENV=$2 # e.g., prod
NEW_TAG=$3 # e.g., v1.43.0
OVERLAY_PATH="apps/${APP}/overlays/${ENV}"
IMAGE_NAME="your-registry/${APP}"
cd k8s-manifests
git config user.email "[email protected]"
git config user.name "CI Bot"
kustomize edit set image \
"${IMAGE_NAME}=${IMAGE_NAME}:${NEW_TAG}" \
--kustomization "${OVERLAY_PATH}/kustomization.yaml"
git add "${OVERLAY_PATH}/kustomization.yaml"
git commit -m "chore(${APP}): bump ${ENV} image to ${NEW_TAG} [ci skip]"
git push origin main
ArgoCD picks up the commit and syncs based on your sync policy and windows. No webhooks needed — ArgoCD polls. If you need faster propagation, configure a webhook from your Git provider to ArgoCD’s /api/webhook endpoint.
Gotcha: The [ci skip] in the commit message prevents CI from triggering again on its own commit if your pipeline triggers on pushes to main. Without this, you get a loop.
Gotcha: Don’t use configMapGenerator with behavior: replace in overlays unless you know what you’re doing. Replace drops all base keys and only keeps overlay keys. merge is almost always what you want. I’ve seen this cause production outages when someone thought they were adding a key but actually wiped the config.
The App of Apps Pattern for SRE Teams
Instead of creating Applications one by one in the ArgoCD UI, define them in Git too. One root Application that manages all other Applications:
# Root app — the only one you create manually
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-apps
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: platform
source:
repoURL: https://github.com/your-org/k8s-manifests
targetRevision: main
path: argocd/apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
Every file in argocd/apps/ is an ArgoCD Application manifest. Adding a new app to the cluster is a Git commit. Removing one is also a Git commit, and ArgoCD handles cleanup via the finalizer.
This pattern also makes your disaster recovery story much cleaner. Rebuild the cluster, install ArgoCD, apply the root app, wait 5 minutes — cluster is back to the desired state.
Production Checklist Before You Call It Done
Before handing this to your SRE team, run through these:
Security:
- ArgoCD itself managed by ArgoCD (app of apps)
- SSO configured (Dex with your OIDC provider, not local users)
- AppProject per team with scoped repo and namespace access
- RBAC roles: read-only for devs, sync for team leads, admin only for platform
Reliability:
- ArgoCD HA mode if you’re running multiple clusters (repo-server and application-controller scaled)
- Redis with persistence for caching repo state
- Separate AppProject for ArgoCD’s own components
Observability:
- Prometheus metrics scraped from
argocd-metrics:8082andargocd-server-metrics:8083 - Grafana dashboard (official one is solid, fork and customize)
- Alerting rules for OutOfSync, Degraded, and sync failures as shown above
Operations:
- Sync windows documented in your runbooks
- Runbook entry for "how to force-sync during a change freeze"
ignoreDifferencescovering all controller-managed fields
Where People Go Wrong
The biggest mistake I see teams make: enabling automated.selfHeal: true and prune: true in production without sync windows, then being surprised when a bad commit auto-deploys and auto-prunes resources.
The second biggest: using ArgoCD for config management without agreeing on a branching strategy. If your team pushes to main without review, ArgoCD will deploy garbage just as efficiently as it deploys good code.
ArgoCD is not a safety net. It’s a force multiplier. It makes your deployment process faster, more auditable, and more reliable — but only if the process feeding it is disciplined.
The drift detection value comes from the alerting discipline around it. If your team gets an OutOfSync alert and the response is "yeah, someone scaled it, ignore it" — you’ve already lost the benefit. Fix that by making the ignoreDifferences config accurate enough that every drift alert is actually actionable.
Sync windows + Kustomize overlays + proper drift alerting is the combination that takes ArgoCD from "cool tool" to "actually running our production safely." Start with drift alerting (lowest risk, immediate value), add Kustomize overlays when you have more than two environments, and gate sync windows only after your team has used ArgoCD for a month and understands the operational implications of blocking a sync.