Putting k3s in Production: The Setup That Doesn’t Burn You at 3 AM

You can spin up k3s in 90 seconds. The fun part is keeping it alive for a year. This guide walks through the production hardening checklist that most tutorials skip — the choices that decide whether your cluster survives the first power loss, the first cert expiry, and the first 10x traffic spike.

We’re assuming a small-to-mid setup: 3 control-plane nodes, a handful of workers, somewhere between 50 and 500 pods. If you’re running fewer than 50 pods on one box, honestly — docker-compose is still fine and you’ll thank yourself.

Why k3s and not k8s

Both run the same containers. Differences that matter in prod:

k3s k8s (kubeadm)
Binary size ~70 MB single binary ~1 GB across multiple components
Default DB sqlite (single-node) / embedded etcd (HA) external etcd
Built-in traefik, servicelb, local-path, helm-controller nothing — bring your own
RAM floor ~512 MB per node ~2 GB per node
Upgrades one binary swap coordinated component dance
ARM support first-class works but rougher

k3s wins for edge, hobbyist-prod, and "small team running ~10 apps." k8s wins when you’re a 20-engineer platform team that already owns the tooling.

Pre-flight — what you fix before curl | sh

Kernel and sysctl

k3s wants a few sysctls set or it complains:

cat <<EOF | sudo tee /etc/sysctl.d/99-k3s.conf
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-ip6tables=1
net.ipv4.ip_forward=1
fs.inotify.max_user_instances=8192
fs.inotify.max_user_watches=524288
vm.swappiness=0
EOF
sudo sysctl --system

The inotify ones bite you around 30+ pods — symptoms are mysterious "no space left on device" errors that have nothing to do with disk.

Disable swap (Kubernetes hates swap)

sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab

k3s 1.22+ technically supports swap with feature gates, but for production just turn it off.

Time sync — non-negotiable

sudo apt install -y chrony
sudo systemctl enable --now chrony
chronyc tracking

Clock skew between nodes breaks etcd, breaks certificate validation, breaks audit logs. If you skip this, you’ll spend a Friday night debugging "random" pod evictions.

Installation: stop pasting curl | sh

The official one-liner ships defaults that are great for a laptop demo and wrong for production. Spell out what you actually want:

# Control-plane node 1
curl -sfL https://get.k3s.io | sudo INSTALL_K3S_EXEC="
  server
  --cluster-init
  --disable=traefik
  --disable=servicelb
  --write-kubeconfig-mode=0640
  --kube-apiserver-arg=audit-log-path=/var/log/k3s-audit.log
  --kube-apiserver-arg=audit-log-maxage=30
  --secrets-encryption
  --tls-san=k3s.internal.example.com
" sh -

What each flag does:

  • --cluster-init — embedded etcd, HA-ready. Don’t skip even on day one; migrating from sqlite to etcd later is painful.
  • --disable=traefik — k3s ships traefik v2. It works, but you’ll want to pin a specific ingress controller via your own GitOps. Disable the bundled one.
  • --disable=servicelb — this is klipper-lb. Replace with MetalLB for real LoadBalancer support.
  • --secrets-encryption — encrypts Secrets at rest in etcd. The default is plaintext.
  • --tls-san — extra SAN on the apiserver cert so you can reach the cluster by a stable hostname later, not just node IPs.
  • audit-log-* — boring until the day legal asks "who deleted that namespace?"

Grab the join token:

sudo cat /var/lib/rancher/k3s/server/node-token

Then join nodes 2 and 3 to form the HA control plane:

curl -sfL https://get.k3s.io | sudo INSTALL_K3S_EXEC="
  server
  --server=https://NODE1_IP:6443
  --disable=traefik
  --disable=servicelb
  --secrets-encryption
" K3S_TOKEN=<token> sh -

Three control-plane nodes give you embedded etcd with quorum. Lose one — cluster still works. Lose two — read-only.

Storage: local-path is a trap in prod

k3s installs local-path-provisioner by default. It’s great for :memory:-style sqlite databases and absolutely catastrophic for anything you want to survive a node failure — pods get stuck pending if the node hosting their PV dies.

For real persistence, install Longhorn:

helm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
  --namespace longhorn-system --create-namespace \
  --set defaultSettings.defaultReplicaCount=2 \
  --set defaultSettings.backupTarget=s3://your-backup-bucket@us-east-1/

Make it the default and demote local-path:

kubectl annotate sc local-path storageclass.kubernetes.io/is-default-class- --overwrite
kubectl annotate sc longhorn storageclass.kubernetes.io/is-default-class=true --overwrite

Longhorn replicates each PV across N nodes. With replicaCount=2, you survive one node loss without dataloss. The backupTarget argument enables S3 snapshots — set it now, not after your first incident.

If you don’t want a separate storage layer, the other reasonable choices are OpenEBS Mayastor (faster, more complex) or — if you have separate storage hardware — NFS via the csi-driver-nfs. Avoid hostPath like the plague.

Ingress: traefik or nginx?

Both are fine. The decision matrix:

  • traefik v2/v3 — Docker-labels-style config, great auto-discovery, weaker for complex routing, smaller community.
  • ingress-nginx — battle-tested, ubiquitous Stack Overflow answers, heavier RAM footprint, ConfigMap-style annotations.

Pick one and commit. Mixing them creates routing confusion. For most teams, ingress-nginx is the safer choice purely because every Helm chart on GitHub assumes you have it.

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx --create-namespace \
  --set controller.kind=DaemonSet \
  --set controller.service.type=LoadBalancer \
  --set controller.metrics.enabled=true

DaemonSet means one pod per node — predictable IPs for LoadBalancer/MetalLB. Metrics on — you’ll need them by week two.

TLS: cert-manager + Let’s Encrypt, day one

Don’t ship to prod without automated certs. The cost is one Helm install:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

Then a ClusterIssuer for Let’s Encrypt:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            class: nginx

Annotate any Ingress with cert-manager.io/cluster-issuer: letsencrypt-prod and certs renew themselves. Production teams that skip this end up with 03:00 PagerDuty alerts every 90 days when something expires manually.

etcd backups: snapshot to S3

k3s embedded etcd can snapshot itself:

sudo k3s etcd-snapshot save \
  --s3 \
  --s3-bucket=my-k3s-backups \
  --s3-region=us-east-1 \
  --s3-access-key=AKIA... \
  --s3-secret-key=...

Schedule it via the built-in cron flag:

sudo systemctl edit k3s
# Add:
# Environment="K3S_ETCD_SNAPSHOT_SCHEDULE_CRON=0 */6 * * *"
# Environment="K3S_ETCD_SNAPSHOT_RETENTION=20"

Six-hour cadence, 20 snapshots retained — five days of point-in-time recovery. Restore is k3s server --cluster-reset --cluster-reset-restore-path=....

If etcd dies and your last backup is yesterday, you can rebuild PVs from Longhorn snapshots but you lose cluster state (deployments, services). The pairing of etcd + Longhorn snapshots is what makes the cluster restorable from total loss.

Network policies — default-deny, not default-allow

Out of the box, every pod can talk to every other pod. In production, switch to default-deny in each namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: web
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Then explicitly allow what each app needs. k3s ships with flannel + the kube-router NetworkPolicy controller by default, so policies actually enforce. Verify with:

kubectl run nginx --image=nginx
kubectl exec -it nginx -- curl -m 3 some-other-pod   # should time out

If it doesn’t time out, your NetworkPolicy isn’t being enforced. Most common cause: someone disabled the network-policy controller in INSTALL_K3S_EXEC.

RBAC: stop using cluster-admin

The default kubectl config from /etc/rancher/k3s/k3s.yaml is cluster-admin. Don’t share that file. For each human/CI/service, create a dedicated ServiceAccount with the minimum verbs and resources they need:

apiVersion: v1
kind: ServiceAccount
metadata: { name: deploy-bot, namespace: web }
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata: { name: deploy-bot, namespace: web }
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: { name: deploy-bot, namespace: web }
roleRef:
  kind: Role
  name: deploy-bot
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: deploy-bot
    namespace: web

If your CI just needs to roll out a Deployment, that’s all the power it should have. Audit log entries from a compromised cluster-admin token look identical to legitimate ones — narrow permissions are how you survive a leaked secret.

Monitoring: prometheus-stack, day one

helm install kube-prom prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.adminPassword=$(openssl rand -hex 16) \
  --set prometheus.prometheusSpec.retention=15d

You get Prometheus + Alertmanager + Grafana + node-exporter + the default k8s dashboards. The dashboards alone catch 80% of production issues before users notice — pod restart rates, CPU throttling, OOMKills, etcd disk latency.

Pin alerts on:

  • KubePodCrashLooping — pod restarting too often
  • KubePersistentVolumeFillingUp — disk pressure
  • KubeAPIErrorBudgetBurn — control-plane unhealthy
  • etcdHighNumberOfFailedProposals — quorum issues

Anything else is nice-to-have. Those four are what wakes you when something is actually broken.

Upgrades: system-upgrade-controller

Don’t apt upgrade your k3s nodes. Use system-upgrade-controller for orchestrated rolling upgrades:

kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml

Then declarative upgrade plans:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  version: v1.30.3+k3s1
  nodeSelector:
    matchExpressions:
      - { key: node-role.kubernetes.io/control-plane, operator: Exists }
  concurrency: 1
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade

It cordons one node, upgrades, drains, uncordons — one at a time. Safe and boring. Worker plan is identical with the opposite nodeSelector.

The pitfalls people learn the expensive way

  • Don’t use Calico with flannel without disabling flannel. k3s ships with flannel; if you install Calico over the top, the two CNIs fight and you get random connectivity loss.
  • Don’t set podCIDR too small. Default 10.42.0.0/16 = 65k IPs. Sounds like a lot. With 250 pods per node and 10 nodes you’re at half. Make it /14 if you have room.
  • Don’t trust kubectl drain. Some pods (especially statefulsets with hostPath) ignore graceful eviction. Always check kubectl get pods after drain — if anything’s still Running, force-delete manually before rebooting.
  • Don’t put control-plane and worker workloads on the same nodes. k3s lets you. It works fine until your apiserver loses arbitration to a runaway pod. Use --tls-san + dedicated control-plane nodes (no taint needed if you don’t schedule app workloads to them).
  • Don’t run k3s on btrfs. Containerd’s snapshotter doesn’t get along with btrfs CoW. ext4 or xfs.
  • Don’t expose the apiserver to the internet. Even with TLS and good RBAC, the apiserver gets scanned constantly. Use a VPN, a bastion, or a private network.

Putting it together — what a real prod deploy looks like

  1. 3 control-plane VMs (4 vCPU / 8 GB RAM / 50 GB SSD each)
  2. N worker VMs (size depends on workload; start with 4 vCPU / 16 GB)
  3. Longhorn for storage, replicaCount=2, S3 backup target
  4. ingress-nginx + cert-manager + Let’s Encrypt
  5. NetworkPolicies in default-deny per namespace
  6. RBAC: zero cluster-admin tokens in CI; per-app ServiceAccounts
  7. prometheus-stack with the 4 alerts above wired to PagerDuty/Slack
  8. etcd snapshots every 6h to S3, 5-day retention
  9. system-upgrade-controller for k3s and node OS upgrades
  10. Slack channel where someone watches it Actually, no — you set up the alerts so you don’t need to watch

Total install time end-to-end: ~3 hours for the first time, ~30 minutes once you script it. Total nights of sleep saved over the next year: many.

Closing

k3s in production isn’t about k3s itself — it’s about the boring hardening checklist around it. Skip cert-manager and you’ll be paged in 90 days. Skip Longhorn and you’ll lose data on the first node failure. Skip RBAC and one leaked CI token compromises everything.

The good news: every item on this list is one Helm chart or one flag. There’s no week-long migration project hiding here. Just decide upfront, write it into your IaC, and forget about it.

Get the cluster healthy. Then forget you’re using k3s — that’s the goal.

Leave a comment

👁 Views: 2,289 · Unique visitors: 1,646