Short-Lived Certificates Done Right: Rotation, Automation, and Observability

Three AM. Your monitoring fires. An internal service is throwing TLS handshake errors. You ssh in, check the cert, and stare at the output: Not After: Apr 14 00:00:00 2025 GMT. It expired six weeks ago. Nobody noticed because the service wasn’t critical enough to monitor — until tonight, when it became a dependency in a new rollout.

This is the fate of manually managed long-lived certificates. Somebody issued them, wrote the expiry date in a spreadsheet, and then the spreadsheet quietly rotted. Long lifetimes are not a convenience — they’re a liability you’re deferring.

Short-lived certificates flip the model. A cert that expires in 24 hours forces you to automate renewal from day one, because you cannot babysit it manually. The attack surface from a stolen key shrinks to hours. Revocation — that historically broken mechanism — becomes almost irrelevant.

This article walks through building a practical short-lived certificate pipeline: a private CA with Vault PKI, automation with cert-manager inside Kubernetes (and a standalone path for VMs), and observability that pages you when the rotation machinery breaks, not when the cert is already dead.

Why Long-Lived Certs Are a Quiet Disaster

The industry default used to be two years. Then Let’s Encrypt pushed everyone toward 90 days. Google is now pushing for 47-day maximums for public certs. Each reduction forces better tooling.

The core problem isn’t the cert itself — it’s the human process around it. When a cert lasts two years, the mental model is "set it and forget it." Renewals happen under pressure, often manually, often late. Each manual step is a place where someone copies a key to their laptop, uploads it over SSH to five servers in a slightly different way, or forgets one of the seven places the cert is configured.

Short-lived certs — think 24h to 7 days for internal services — make the automation non-negotiable. The pipeline either works or the service breaks. That’s a feature, not a bug. It forces you to solve the hard problems upfront: where is the CA, who can request certs, how does renewal get triggered, and how do you know when the whole thing fails.

The Stack

Here’s what we’re building with:

HashiCorp Vault — PKI secrets engine as the internal CA. It’s battle-tested, has solid audit logging, and has first-class integrations everywhere. If you’re already running Vault for secrets, this is zero additional infrastructure.
cert-manager — the de facto Kubernetes controller for certificate lifecycle. Handles renewals, watches expiry, writes certs into Secrets.
step-ca — lightweight alternative for teams that don’t run Vault. ACME-compatible, trivial to self-host.
Prometheus + Alertmanager — for the observability layer.

For non-Kubernetes workloads, we’ll cover a standalone systemd timer approach using the Vault agent.

Setting Up Vault PKI

If you don’t have Vault running, the quickest path for a homelab or staging environment is Docker Compose:

# docker-compose.yml
services:
  vault:
    image: hashicorp/vault:1.17
    container_name: vault
    cap_add:
      - IPC_LOCK
    environment:
      VAULT_DEV_ROOT_TOKEN_ID: "root"   # dev mode only — replace with proper init in prod
      VAULT_DEV_LISTEN_ADDRESS: "0.0.0.0:8200"
    ports:
      - "8200:8200"
    volumes:
      - vault-data:/vault/data
    command: server -dev

volumes:
  vault-data:

In production you want HA with Raft storage, not dev mode. But for wiring up the PKI engine, the commands are identical.

Enable the PKI engine and configure the root CA:

# Enable PKI secrets engine
vault secrets enable pki

# Set max TTL to 10 years for the root CA
vault secrets tune -max-lease-ttl=87600h pki

# Generate internal root CA — keep this offline in production
vault write -field=certificate pki/root/generate/internal \
    common_name="Internal Root CA" \
    ttl=87600h > root_ca.crt

# Configure CRL and issuing endpoints
vault write pki/config/urls \
    issuing_certificates="http://vault:8200/v1/pki/ca" \
    crl_distribution_points="http://vault:8200/v1/pki/crl"

Now add an intermediate CA. You want a two-tier hierarchy — the root stays offline (or its key never leaves Vault’s HSM-backed storage), and the intermediate does the day-to-day signing:

# Enable a second mount for the intermediate
vault secrets enable -path=pki_int pki
vault secrets tune -max-lease-ttl=43800h pki_int

# Generate CSR for intermediate
vault write -format=json pki_int/intermediate/generate/internal \
    common_name="Internal Intermediate CA" \
    | jq -r '.data.csr' > pki_int.csr

# Sign with root
vault write -format=json pki/root/sign-intermediate \
    csr=@pki_int.csr \
    format=pem_bundle \
    ttl=43800h \
    | jq -r '.data.certificate' > intermediate.cert.pem

# Import signed cert back into intermediate mount
vault write pki_int/intermediate/set-signed \
    [email protected]

Create a role that will issue short-lived service certificates:

# Role for internal microservices — 24h max TTL
vault write pki_int/roles/internal-services \
    allowed_domains="internal.example.com,svc.cluster.local" \
    allow_subdomains=true \
    allow_bare_domains=false \
    max_ttl=24h \
    key_type=ec \
    key_bits=256 \
    require_cn=true

# Policy that allows requesting certs under this role
vault policy write cert-requester - <<EOF
path "pki_int/issue/internal-services" {
  capabilities = ["create", "update"]
}
EOF

Gotcha: Don’t set allow_any_name=true unless you specifically need it. It turns your CA into a wildcard machine that will sign whatever garbage gets sent to it. Lock down the allowed_domains to exactly what you need.

cert-manager for Kubernetes

cert-manager is the right answer for anything running in Kubernetes. Install it:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.15.0/cert-manager.yaml

Configure a ClusterIssuer that talks to Vault:

# vault-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: vault-internal
spec:
  vault:
    server: http://vault.vault.svc.cluster.local:8200
    path: pki_int/sign/internal-services  # note: sign, not issue
    auth:
      kubernetes:
        mountPath: /v1/auth/kubernetes
        role: cert-manager
        secretRef:
          name: cert-manager-vault-token
          key: token

You need Vault’s Kubernetes auth method configured so cert-manager can authenticate:

# Enable kubernetes auth
vault auth enable kubernetes

# Configure it to talk to your cluster's API
vault write auth/kubernetes/config \
    kubernetes_host="https://kubernetes.default.svc:443" \
    kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt

# Create a role that maps the cert-manager service account to the cert-requester policy
vault write auth/kubernetes/roles/cert-manager \
    bound_service_account_names=cert-manager \
    bound_service_account_namespaces=cert-manager \
    policies=cert-requester \
    ttl=1h

Now issue a Certificate for a service. Note the duration and renewBefore — this is where you control the short-lived behavior:

# service-cert.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-service-tls
  namespace: production
spec:
  secretName: api-service-tls-secret
  issuerRef:
    name: vault-internal
    kind: ClusterIssuer
  duration: 24h        # cert lives 24 hours
  renewBefore: 4h      # cert-manager renews 4h before expiry
  dnsNames:
    - api.internal.example.com
    - api.production.svc.cluster.local
  privateKey:
    algorithm: ECDSA
    size: 256
    rotationPolicy: Always  # generate new key on every renewal

rotationPolicy: Always is important. Without it, cert-manager reuses the private key across renewals, which partially defeats the purpose of short rotation windows. A compromised key stays compromised until you explicitly trigger a new CSR.

The resulting Secret gets mounted into your pods. For services that don’t support dynamic cert reload, you’ll want to trigger a rolling restart — there’s a clean way to do this with the cert-manager.io/inject-ca-from annotation and a sidecar, or you can use a simple operator like stakater/Reloader that watches the Secret for changes and restarts the Deployment automatically.

Bare Metal and VMs: Vault Agent Approach

Not everything runs in Kubernetes. For VMs and bare metal, the Vault Agent with a systemd timer is the practical path.

Install Vault Agent as a systemd service that runs on a schedule:

# /etc/systemd/system/vault-cert-renew.service
[Unit]
Description=Renew TLS certificate from Vault
After=network-online.target

[Service]
Type=oneshot
User=vault-agent
ExecStart=/usr/local/bin/vault-agent -config=/etc/vault-agent/config.hcl
EnvironmentFile=/etc/vault-agent/env

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/vault-cert-renew.timer
[Unit]
Description=Run Vault cert renewal every 6 hours
Requires=vault-cert-renew.service

[Timer]
OnBootSec=5min
OnUnitActiveSec=6h
Persistent=true   # catches up missed runs after downtime

[Install]
WantedBy=timers.target

The Vault Agent config does the actual work:

# /etc/vault-agent/config.hcl
vault {
  address = "https://vault.internal.example.com:8200"
}

# AppRole auth — suitable for VMs without Kubernetes
auto_auth {
  method "approle" {
    config = {
      role_id_file_path   = "/etc/vault-agent/role-id"
      secret_id_file_path = "/etc/vault-agent/secret-id"
      remove_secret_id_file_after_reading = false
    }
  }
}

# Write the cert and key to disk
template {
  contents = <<EOT
{{ with secret "pki_int/issue/internal-services"
   "common_name=api.internal.example.com"
   "ttl=24h" }}
{{ .Data.certificate }}
{{ end }}
EOT
  destination = "/etc/ssl/service/cert.pem"
  command     = "systemctl reload nginx"  # reload on change
}

template {
  contents = <<EOT
{{ with secret "pki_int/issue/internal-services"
   "common_name=api.internal.example.com"
   "ttl=24h" }}
{{ .Data.private_key }}
{{ end }}
EOT
  destination = "/etc/ssl/service/key.pem"
}

Gotcha: The secret-id for AppRole auth has its own TTL and use-count limits. If you set secret_id_num_uses=1, a retry loop can burn through it. For VM-based rotation, use secret_id_num_uses=0 (unlimited) but set a reasonable secret_id_ttl (e.g., 168h) and have your provisioning system rotate it periodically.

Observability: What to Monitor

The worst failure mode in automated certificate management isn’t a cert expiring — it’s the rotation machinery silently breaking. The cert was issued two months ago, renewal kept failing (maybe Vault was briefly unreachable, maybe the auth role expired), and nobody noticed because the current cert was still valid.

By the time you catch it, you have four hours left and a broken renewal pipeline to debug under pressure.

You need three layers of monitoring.

Layer 1: Certificate expiry itself

Prometheus’ ssl_exporter or the built-in cert-manager metrics expose expiry timestamps. This is your last-resort alert — if you’re paging on this, your automation already failed.

# prometheus alerting rules
groups:
  - name: certificates
    rules:
      # Critical: cert expires in under 6 hours — automation has failed
      - alert: CertificateExpiryCritical
        expr: |
          (ssl_cert_not_after - time()) / 3600 < 6
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Certificate expires in {{ $value | humanizeDuration }}"
          description: "{{ $labels.instance }} cert for {{ $labels.cn }} expires very soon"

      # Warning: cert expires in under 24h — investigation needed
      - alert: CertificateExpiryWarning
        expr: |
          (ssl_cert_not_after - time()) / 3600 < 24
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Certificate expiring soon: {{ $labels.cn }}"

Layer 2: cert-manager internal metrics

cert-manager exposes Prometheus metrics on port 9402. The key one is certmanager_certificate_ready_status:

      # cert-manager has a certificate stuck in non-Ready state
      - alert: CertManagerCertificateNotReady
        expr: |
          certmanager_certificate_ready_status{condition="False"} == 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "cert-manager Certificate not ready"
          description: "Certificate {{ $labels.namespace }}/{{ $labels.name }} is not ready"

      # Renewal is failing — condition True but expiry is approaching
      - alert: CertManagerRenewalFailure
        expr: |
          certmanager_certificate_renewal_timestamp < (time() - 3600)
          and
          (certmanager_certificate_expiration_timestamp - time()) < 43200
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "cert-manager certificate renewal stalled"

Layer 3: Vault PKI health

Monitor Vault’s own metrics for the PKI engine. The most useful ones are request error rates on the pki_int/issue path and CRL expiry:

      # Vault PKI CRL is about to expire — all certificate validation will break
      - alert: VaultPKICRLExpiry
        expr: |
          vault_secret_kv_count{mount="pki_int"} > 0  # vault is up
          and
          (vault_pki_crl_expiry_timestamp - time()) < 86400
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Vault PKI CRL expires in under 24h"

Gotcha: The CRL expiry is a silent killer. If your CRL expires and clients are configured to check it (they should be), they’ll reject valid certificates. For short-lived certs this matters less — a 24h cert will expire before a 7-day CRL does — but it still affects intermediate and root certs. Automate CRL rotation or set a long CRL TTL (7 days is common) and alert hard at 2 days.

Add a Grafana dashboard with four panels: cert expiry heatmap across all services, cert-manager reconcile queue depth, Vault PKI request success rate, and a table of certs sorted by time-to-expiry. The heatmap gives you the overview; the table tells you which service needs attention right now.

Production-Ready Patterns

Root CA key protection. In a real deployment, the root CA key should never touch Vault’s storage in plaintext. Use Vault Enterprise with a PKCS#11 HSM integration, or seal the root CA offline and only use it to sign intermediates, which you rotate annually. For homelab or small teams without HSM access, Vault’s auto-unseal with AWS KMS or GCP Cloud KMS is a reasonable middle ground — the key material is encrypted at rest and Vault can restart without manual intervention.

Separate intermediate CAs per environment. Don’t use the same intermediate CA for production, staging, and dev. If your dev environment gets compromised, the blast radius should stop at the intermediate. cert-manager can handle multiple ClusterIssuers pointing to different Vault mounts, so this is a configuration change, not a code change.

Don’t skip the SPIFFE/SVID path for service-to-service mTLS. If you’re issuing certificates for microservices that authenticate each other (mTLS), use SPIFFE URI SANs (spiffe://cluster.local/ns/production/sa/api-service) rather than just DNS names. This is what tools like SPIRE and Istio do natively. cert-manager supports it via the uris field in the Certificate spec. It makes authorization policy much cleaner: you’re attesting identity, not just hostname.

Health check endpoint for your rotation pipeline. Add a simple check that verifies the cert-rotation timer ran successfully in the last N hours. For systemd, this looks like:

#!/bin/bash
# /usr/local/bin/check-cert-renewal
LAST_RUN=$(systemctl show vault-cert-renew.service --property=ExecMainExitTimestamp --value)
LAST_RUN_EPOCH=$(date -d "$LAST_RUN" +%s 2>/dev/null || echo 0)
NOW=$(date +%s)
AGE=$(( NOW - LAST_RUN_EPOCH ))

# Alert if the service hasn't run successfully in 8 hours
if [ $AGE -gt 28800 ]; then
    echo "CRITICAL: vault-cert-renew last ran $((AGE / 3600)) hours ago"
    exit 2
fi
echo "OK: vault-cert-renew ran $((AGE / 60)) minutes ago"
exit 0

Plug this into your Nagios/Icinga/Prometheus textfile collector. It’s a dead simple check that catches "systemd timer got disabled after a server rebuild" before it matters.

The step-ca Alternative

If Vault feels like too much infrastructure for your scale, step-ca from Smallstep is worth a serious look. It’s a single Go binary, supports ACME protocol natively (so any ACME-compatible client — certbot, acme.sh, Caddy, Traefik — works against your private CA), and has first-class short-lived cert support.

The full setup is roughly:

# Initialize CA
step ca init --name "Internal CA" --dns "ca.internal.example.com" \
    --address ":443" --provisioner "[email protected]"

# Start it
step-ca $(step path)/config/ca.json

# Issue a 24h cert via ACME
certbot certonly --server https://ca.internal.example.com/acme/acme/directory \
    --domains api.internal.example.com \
    --standalone \
    --cert-name api-internal

The main trade-off against Vault: step-ca has no secrets engine, no audit backend beyond logs, and no native Kubernetes auth. For pure PKI workloads it’s excellent. For teams already running Vault for application secrets, there’s no reason to add another CA.

The Operational Mindset Shift

Short-lived certificates change how you think about failures. The question is no longer "did this cert expire?" — the question is "is the automation healthy?" You’re monitoring a pipeline, not a static artifact.

This is the right mental model for any automation that manages infrastructure state. The cert expiry alert is the smoke alarm. You want the carbon monoxide detector that tells you the furnace is about to fail, not the one that fires when the house is already full of smoke.

Build the observability first. Wire up cert-manager metrics to Prometheus before you issue your first short-lived cert. Know what a healthy rotation cycle looks like in your dashboards. When something breaks — and it will break, usually during a Vault maintenance window or a misconfigured auth role — you’ll have the context to fix it in minutes instead of hours.

The infrastructure for all of this — Vault, cert-manager, Prometheus — is the same infrastructure you should be running anyway for secrets management, Kubernetes operations, and general observability. Short-lived certificate rotation is not a new system to operate; it’s a feature of the systems you already need.