Keycloak in Production: Clustering, Custom Themes, Federation, and Auth Flows That Actually Work

Most Keycloak tutorials show you a single-node docker run command and call it a day. That works fine for a Saturday afternoon demo. The moment you point real users at it, you find out that Keycloak’s distributed session cache has opinions, your LDAP tree doesn’t map cleanly to its user model, and the default login page looks like it was designed in 2009. This article covers the whole path from that toy setup to something you’d actually trust with SSO for 50,000 users.

Official repo: https://github.com/keycloak/keycloak


The Architecture You’re Aiming For

Before touching a config file, commit the mental model: Keycloak nodes are stateless application servers. All session and token state lives in an embedded Infinispan cache that can replicate across cluster members. The database (Postgres, ideally) holds realm config, user attributes from local storage, and persistent sessions. Your load balancer sits in front and must be sticky only for initial login redirects — not for the full session lifetime.

Get this wrong and you end up with intermittent 401s that reproduce only under load.


Database First, Always

Keycloak will happily start against H2. H2 is a trap. One restart and your realm config is gone.

Use Postgres. Pin the schema migration to a specific Keycloak version. Never let two cluster nodes run migrations simultaneously — use KC_DB_MIGRATE=false on all but one node during upgrades, or use an init container to run the migration job.

# postgres.yml — run this before the Keycloak stack
services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: keycloak
      POSTGRES_USER: keycloak
      POSTGRES_PASSWORD: ${KC_DB_PASSWORD}
    volumes:
      - pg_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U keycloak"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - kc_internal

volumes:
  pg_data:

networks:
  kc_internal:
    external: true

Clustered Keycloak with Docker Compose

The following compose file runs two Keycloak nodes. Infinispan cluster discovery uses JGROUPS_DISCOVERY_PROTOCOL=JDBC_PING — it writes member records to the Postgres database. This is reliable, requires zero multicast, and works across Docker networks, VMs, and cloud VPCs without special networking.

# keycloak-cluster.yml
services:
  kc1:
    image: quay.io/keycloak/keycloak:25.0.2
    command: start --optimized
    environment:
      # Database
      KC_DB: postgres
      KC_DB_URL: jdbc:postgresql://postgres:5432/keycloak
      KC_DB_USERNAME: keycloak
      KC_DB_PASSWORD: ${KC_DB_PASSWORD}

      # Clustering — JDBC_PING uses DB for member discovery, no multicast needed
      KC_CACHE: ispn
      KC_CACHE_STACK: jdbc-ping
      JGROUPS_DISCOVERY_PROTOCOL: JDBC_PING
      JGROUPS_DISCOVERY_PROPERTIES: >
        datasource_jndi_name=java:jboss/datasources/KeycloakDS,
        initialize_sql="CREATE TABLE IF NOT EXISTS JGROUPSPING
          (own_addr VARCHAR(200) NOT NULL, cluster_name VARCHAR(200) NOT NULL,
           ping_data BYTEA, constraint PK_JGROUPSPING PRIMARY KEY (own_addr, cluster_name))"

      # Hostname — must be the public-facing URL, not the container hostname
      KC_HOSTNAME: https://auth.example.com
      KC_HOSTNAME_STRICT: "true"
      KC_PROXY: edge  # TLS terminated at the load balancer

      # Admin credentials (first boot only; change via UI afterward)
      KEYCLOAK_ADMIN: ${KC_ADMIN_USER}
      KEYCLOAK_ADMIN_PASSWORD: ${KC_ADMIN_PASSWORD}

      # Logging
      KC_LOG_LEVEL: INFO
      KC_LOG: console

      # Performance tuning
      JAVA_OPTS_APPEND: >
        -Xms512m -Xmx1024m
        -XX:+UseG1GC
        -Djava.net.preferIPv4Stack=true

    volumes:
      - ./themes:/opt/keycloak/themes        # custom themes
      - ./providers:/opt/keycloak/providers  # custom SPIs / JARs
    networks:
      - kc_internal
      - kc_public
    depends_on:
      postgres:
        condition: service_healthy
    restart: unless-stopped

  kc2:
    # Identical to kc1 — Keycloak nodes are symmetric
    extends:
      service: kc1

  nginx:
    image: nginx:1.27-alpine
    volumes:
      - ./nginx/keycloak.conf:/etc/nginx/conf.d/default.conf:ro
      - /etc/letsencrypt:/etc/letsencrypt:ro
    ports:
      - "443:443"
      - "80:80"
    networks:
      - kc_public
    depends_on:
      - kc1
      - kc2
    restart: unless-stopped

networks:
  kc_internal:
    external: true
  kc_public:
    driver: bridge

Nginx upstream config:

# nginx/keycloak.conf
upstream keycloak {
    # ip_hash for initial login redirect stickiness only
    ip_hash;
    server kc1:8080 fail_timeout=10s max_fails=3;
    server kc2:8080 fail_timeout=10s max_fails=3;
}

server {
    listen 443 ssl http2;
    server_name auth.example.com;

    ssl_certificate /etc/letsencrypt/live/auth.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/auth.example.com/privkey.pem;

    # Forward real client IP for Keycloak's audit logs
    proxy_set_header X-Real-IP         $remote_addr;
    proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto https;
    proxy_set_header Host              $host;

    location / {
        proxy_pass http://keycloak;
        proxy_connect_timeout 10s;
        proxy_send_timeout    60s;
        proxy_read_timeout    60s;
    }
}

server {
    listen 80;
    server_name auth.example.com;
    return 301 https://$host$request_uri;
}

Gotcha: KC_HOSTNAME must exactly match the URL your users hit, including the scheme. If it doesn’t, Keycloak will issue tokens with the wrong iss claim and every app will reject them immediately. Also, KC_PROXY: edge tells Keycloak to trust X-Forwarded-* headers — without this, it sees all traffic as plain HTTP and breaks redirects.


Custom Themes

The default Keycloak UI screams "enterprise Java 2015." Your users shouldn’t have to suffer for it. Keycloak’s theme system is layered: you inherit from base or keycloak, override only what you need.

Theme structure for a login page:

themes/
└── my-company/
    ├── login/
    │   ├── theme.properties          # declares parent, imports
    │   ├── resources/
    │   │   ├── css/
    │   │   │   └── login.css
    │   │   └── img/
    │   │       └── logo.png
    │   └── templates/
    │       └── login.ftl             # overrides only the login form
    └── account/
        └── theme.properties          # use keycloak.v3 for the new account console

themes/my-company/login/theme.properties:

parent=keycloak
import=common/keycloak

# Replace the default stylesheet, keep everything else from parent
styles=css/login.css

# Variables available in Freemarker templates
kcBodyClass=my-company-login

themes/my-company/login/templates/login.ftl — start by copying from the Keycloak source, then gut the parts you don’t need. The critical Freemarker variables:

<#-- login.ftl — stripped down to essentials -->
<#import "template.ftl" as layout>
<@layout.registrationLayout displayMessage=!messagesPerField.existsError('username','password') displayInfo=realm.password && realm.registrationAllowed && !registrationDisabled??; section>
  <#if section = "header">
    <img src="${url.resourcesPath}/img/logo.png" alt="${realm.displayName}" class="logo"/>
  </#if>
  <#if section = "form">
    <form action="${url.loginAction}" method="post">
      <input type="hidden" id="id-hidden-input" name="credentialId"
             <#if auth.selectedCredential?has_content>value="${auth.selectedCredential}"</#if>/>
      <div class="form-group">
        <label for="username">${msg("usernameOrEmail")}</label>
        <input id="username" name="username" type="text" autofocus autocomplete="username"
               value="${(login.username!'')?html}"
               class="<#if messagesPerField.existsError('username','password')>error</#if>"/>
      </div>
      <div class="form-group">
        <label for="password">${msg("password")}</label>
        <input id="password" name="password" type="password" autocomplete="current-password"
               class="<#if messagesPerField.existsError('username','password')>error</#if>"/>
      </div>
      <input type="submit" value="${msg("doLogIn")}" class="btn-primary"/>
    </form>
  </#if>
</@layout.registrationLayout>

Assign the theme to your realm in Realm Settings → Themes → Login Theme.

Gotcha: Keycloak caches themes aggressively. During development, set KC_SPI_THEME_STATIC_MAX_AGE=-1 and KC_SPI_THEME_CACHE_THEMES=false. Forgetting to revert this in production is a performance tax you’ll pay on every page render.


User Federation with LDAP / Active Directory

"Just point it at LDAP" is how every federation story starts. Reality: your LDAP tree has structural quirks, group membership lives in memberOf attributes that need special handling, and service account permissions are rarely what Keycloak needs.

The cleanest path is a dedicated read-only bind account in LDAP with access scoped to exactly the OUs Keycloak needs. Never use a domain admin for this.

Key LDAP federation settings (via Admin UI or kcadm.sh):

# Create the LDAP federation using kcadm — reproducible in IaC
kcadm.sh create components -r your-realm \
  -s name="corporate-ldap" \
  -s providerId=ldap \
  -s providerType=org.keycloak.storage.UserStorageProvider \
  -s 'config.vendor=["ad"]' \
  -s 'config.connectionUrl=["ldap://dc1.corp.example.com:389"]' \
  -s 'config.bindDn=["CN=keycloak-svc,OU=ServiceAccounts,DC=corp,DC=example,DC=com"]' \
  -s "config.bindCredential=[\"${LDAP_BIND_PASSWORD}\"]" \
  -s 'config.usersDn=["OU=Users,DC=corp,DC=example,DC=com"]' \
  -s 'config.userObjectClasses=["person,organizationalPerson,user"]' \
  -s 'config.searchScope=["2"]' \
  -s 'config.useTruststoreSpi=["ldapsOnly"]' \
  -s 'config.connectionTimeout=["5000"]' \
  -s 'config.readTimeout=["10000"]' \
  -s 'config.pagination=["true"]' \
  -s 'config.batchSizeForSync=["500"]' \
  -s 'config.fullSyncPeriod=["604800"]' \
  -s 'config.changedSyncPeriod=["3600"]' \
  -s 'config.importEnabled=["true"]' \
  -s 'config.syncRegistrations=["false"]'

Group mapper — map LDAP memberOf to Keycloak groups:

kcadm.sh create components -r your-realm \
  -s name="group-mapper" \
  -s providerId=group-ldap-mapper \
  -s providerType=org.keycloak.storage.ldap.mappers.LDAPStorageMapper \
  -s parentId=<ldap-component-id> \
  -s 'config.groups.dn=["OU=Groups,DC=corp,DC=example,DC=com"]' \
  -s 'config.group.object.classes=["group"]' \
  -s 'config.membership.attribute.type=["DN"]' \
  -s 'config.membership.ldap.attribute=["member"]' \
  -s 'config.membership.user.ldap.attribute=["distinguishedName"]' \
  -s 'config.mode=["READ_ONLY"]' \
  -s 'config.user.roles.retrieve.strategy=["LOAD_GROUPS_BY_MEMBER_ATTRIBUTE_RECURSIVELY"]' \
  -s 'config.drop.non.existing.groups.during.sync=["true"]'

Gotcha: AD’s member attribute only stores direct members by default. Recursive group membership (LOAD_GROUPS_BY_MEMBER_ATTRIBUTE_RECURSIVELY) hits AD hard on large directories. Either enable AD’s memberOf range retrieval or pre-flatten groups in an auxiliary OU. Also: if your AD uses referrals, set config.referral=["follow"] — otherwise Keycloak silently drops users from child domains.

For social login (Google, GitHub, etc.), add identity providers under Identity Providers in the realm. The only non-obvious part is the First Login Flow setting — by default it asks users to confirm their email even if it matches an existing local account. Override this with a custom first-broker-login flow if you want transparent account linking.


Custom Authentication Flows

This is where Keycloak’s power really shows — and where most people give up. Authentication flows are a directed graph of authenticators. You can add MFA, risk-based auth, custom challenges, or completely replace the login sequence.

Real use case: require TOTP only for users in a specific group, skip it for internal IPs.

  1. Go to Authentication → Flows → browser → Copy (never edit the built-in flows, they cannot be reset without a full realm reset).
  2. Name it browser-conditional-otp.
  3. In the copied flow, find the Browser - Conditional OTP sub-flow. Set it to CONDITIONAL.
  4. Add two conditions:
    • Condition - User Configured — REQUIRED (ensures TOTP is only required if the user has it set up)
    • Condition - User Role — REQUIRED, configure it to check realm-role:requires-2fa
  5. Bind this flow to the realm’s Browser Flow.

For conditional IP-based bypass, you need a custom authenticator SPI. The scaffold:

// src/main/java/com/example/keycloak/ConditionalIpAuthenticator.java
package com.example.keycloak;

import org.keycloak.authentication.AuthenticationFlowContext;
import org.keycloak.authentication.Authenticator;
import org.keycloak.models.*;

public class ConditionalIpAuthenticator implements Authenticator {

    private static final String TRUSTED_CIDR_CONFIG = "trusted-cidr";

    @Override
    public void authenticate(AuthenticationFlowContext ctx) {
        String clientIp = ctx.getConnection().getRemoteAddr();
        String trustedCidr = ctx.getAuthenticatorConfig()
            .getConfig().getOrDefault(TRUSTED_CIDR_CONFIG, "");

        if (!trustedCidr.isBlank() && isInCidr(clientIp, trustedCidr)) {
            // Skip MFA for trusted networks — set a session note for downstream checks
            ctx.getAuthenticationSession().setAuthNote("ip-trusted", "true");
            ctx.success();
        } else {
            ctx.attempted(); // Let the flow continue to the next authenticator
        }
    }

    @Override public void action(AuthenticationFlowContext ctx) {}
    @Override public boolean requiresUser() { return false; }
    @Override public boolean configuredFor(KeycloakSession s, RealmModel r, UserModel u) { return true; }
    @Override public void setRequiredActions(KeycloakSession s, RealmModel r, UserModel u) {}
    @Override public void close() {}

    private boolean isInCidr(String ip, String cidr) {
        // Real implementation: use Apache Commons Net SubnetUtils or similar
        // Never roll your own CIDR math in production
        return new org.apache.commons.net.util.SubnetUtils(cidr)
            .getInfo().isInRange(ip);
    }
}

Build it as a JAR with the standard Keycloak SPI service file at META-INF/services/org.keycloak.authentication.AuthenticatorFactory, drop it into providers/, and restart the containers. The factory implementation wires it into the UI — see the Keycloak SPI docs for the boilerplate.

Gotcha: Custom authenticator JARs must be compiled against the same Keycloak version you’re running. The internal APIs shift between major versions. Pin keycloak-core and keycloak-server-spi in your pom.xml to match. Also: after adding a new JAR, Keycloak needs a --optimized rebuild if you’re using the quarkus build. Run docker exec kc1 /opt/keycloak/bin/kc.sh build or rebuild your image.


Production Hardening Checklist

Session limits: Set SSO Session Max and SSO Session Idle in Realm Settings to sane values. The defaults (10 hours idle, 10 hours max) are far too long for anything consumer-facing.

Brute force protection: Enable it. Set a lockout policy. Configure maxLoginFailures: 5, waitIncrements: 60, maxWait: 900. Do this in kcadm.sh so it’s in your IaC:

kcadm.sh update realms/your-realm \
  -s bruteForceProtected=true \
  -s failureFactor=5 \
  -s waitIncrementSeconds=60 \
  -s maxFailureWaitSeconds=900 \
  -s minimumQuickLoginWaitSeconds=60

Token lifetimes: Access tokens should be short (5 minutes for APIs, 15 for web apps). Refresh tokens can be longer. Never issue access tokens that live for hours — that defeats the purpose of revocation.

Audit logging: Keycloak emits login events, but they’re discarded by default after 30 days. Export them to a log aggregator (Loki, Elasticsearch) via the KC_LOG JSON format. Set KC_LOG_CONSOLE_FORMAT=json and ship stdout.

Metrics: Enable the metrics endpoint with KC_METRICS_ENABLED=true. Scrape /metrics with Prometheus. The critical metrics are keycloak_logins_total, keycloak_failed_login_attempts_total, and Infinispan cache hit rates. A sudden spike in login failures before a corresponding spike in lockouts means someone is using distributed credentials and evading per-IP limits.

Gotcha: The /metrics and /health endpoints are on port 9000 by default in Keycloak 23+. Don’t expose this port through your public load balancer. Bind it to an internal monitoring network only. I’ve seen installations that exposed admin metrics to the public internet and then wondered why their Grafana dashboard was public knowledge.


Realm Export and GitOps

Manual realm configuration through the UI doesn’t scale across environments. Export your realm:

# Export to a single file (excludes user passwords, includes everything else)
docker exec kc1 /opt/keycloak/bin/kc.sh export \
  --dir /tmp/realm-export \
  --realm your-realm \
  --users realm_file

docker cp kc1:/tmp/realm-export/your-realm-realm.json ./realm-exports/

Commit this JSON to git. On deploy, import it:

# Import on first boot via env var
KC_IMPORT=/opt/keycloak/data/import/your-realm-realm.json

Mount the export file into the container and set KC_IMPORT. Keycloak will skip the import if the realm already exists — no idempotency problems.

The real value here is diffing realm changes in PRs. Clients, scopes, federation configs, flow assignments — all reviewable in git. No more "who changed the token lifetime and when."


Where Things Will Break

Two failure modes that hit almost everyone:

Clock skew. Keycloak signs tokens with a timestamp. If your application server’s clock is more than 30 seconds off from the Keycloak server’s clock, token validation fails with Token is not active. Run chronyc tracking on every node. Use the same NTP source everywhere. This sounds obvious until it’s 3am and you’re staring at 401s from a cloud instance that drifted after a live migration.

Infinispan split-brain. If network partitivity between cluster nodes causes the cache to split, users will experience intermittent session loss. The JDBC_PING discovery helps but doesn’t prevent this entirely. Add a readiness probe that checks /health/ready and remove split-brain nodes from the load balancer rotation automatically. Kubernetes handles this well; Docker Compose needs a separate health-check script or an external orchestrator.

Both of these are "works fine in staging, fails in production" problems because staging usually runs single-node and on a single host. Test your HA setup under actual load before go-live.


Keycloak has a reputation for being complex, and honestly it earns that reputation. But once it’s running properly — clustered, themed, federated, with flows that match your actual security requirements — it’s remarkably solid. The configuration surface is large because identity is genuinely complex. The answer isn’t to reach for a SaaS IdP that hides the complexity; it’s to understand each layer and configure it deliberately.

Leave a comment

👁 Views: 2,289 · Unique visitors: 1,646