Self-Hosting Kafka the Right Way: KRaft Mode, Cluster Sizing, and JMX Observability

Running Kafka in production without ZooKeeper used to be a fantasy. For years, every Kafka deployment dragged along a ZooKeeper ensemble — three more nodes, three more failure domains, three more things to wake you up at 2 AM. Kafka 3.3 made KRaft stable for production, and Kafka 3.5+ deprecated ZooKeeper entirely. If you’re still managing a ZK quorum in 2026, you’re carrying technical debt that will cost you.

This guide walks through a complete self-hosted Kafka setup: KRaft mode from scratch, sizing your cluster based on actual workload math, and wiring JMX metrics into Prometheus so you can see what’s happening before something breaks. No managed service upsell. No hand-waving about "it depends". Real configuration you can adapt and deploy today.

Official GitHub: github.com/apache/kafka


Why KRaft Changes Everything

The old architecture forced a hard coupling between Kafka and ZooKeeper. Metadata (topic configs, partition assignments, ISR state) lived in ZK, meaning every broker leaderboard election had to touch an external system. This created a ceiling — large clusters with hundreds of thousands of partitions hit ZK coordination limits and became genuinely painful to manage.

KRaft replaces ZooKeeper with a Raft-based metadata quorum built directly into Kafka itself. A subset of brokers (or dedicated controller nodes) participate in the quorum. Metadata is stored in an internal __cluster_metadata topic. The result is faster controller failover (seconds instead of tens of seconds), support for millions of partitions, and one fewer operational dependency.

There’s a mode distinction worth knowing up front: brokers can run as broker, controller, or combined (both). For small clusters (under 5 nodes), combined mode is fine. For anything production-serious, run dedicated controller nodes. More on this in the sizing section.


Cluster Sizing: The Numbers That Actually Matter

Before touching a config file, answer three questions:

1. What’s your peak throughput?
Kafka is I/O-bound. A single broker on decent hardware (8 cores, NVMe, 10 GbE) can handle 200–400 MB/s sustained write throughput. Replication multiplies that — a replication factor of 3 means 3x the I/O at the broker holding the leader.

2. What’s your retention window?
Kafka stores data on disk. retention.ms × peak write rate = your minimum disk requirement. Add 30% headroom. Don’t forget that log compaction topics retain data differently — they don’t expire by time, they retain only the latest value per key.

3. How many partitions do you need?
Partitions are the unit of parallelism. More consumer instances → more partitions needed. Rule of thumb: max(desired_consumer_parallelism, producer_concurrency). Don’t over-partition — each partition consumes memory on every broker (roughly 1 MB per partition per broker for index files). A cluster with 10,000 partitions is manageable. 500,000 is not unless you’ve done your homework.

Sizing Templates

Cluster Size Workload Nodes vCPU/node RAM/node Disk/node
Small < 50 MB/s, dev/staging 3 combined 4 8 GB 500 GB NVMe
Medium 50–200 MB/s, prod 3 brokers + 3 controllers 8 16 GB 2 TB NVMe
Large 200 MB/s+, high-throughput 6+ brokers + 3 controllers 16 32 GB 4 TB NVMe RAID

For the medium and large tiers, use dedicated controller nodes. They’re lightweight — a controller quorum handles only metadata, not data path traffic. A 4 vCPU / 8 GB controller node is sufficient.

JVM heap sizing: Set -Xms and -Xmx equal to prevent heap resizing. For brokers, 6 GB heap is a reasonable starting point. Don’t go above 8–10 GB — large heaps cause long GC pauses. Kafka relies heavily on the OS page cache, so leave the rest of RAM for the OS to cache disk I/O.


Setting Up KRaft: Docker Compose That Actually Works

Let’s build a 3-node combined KRaft cluster with Docker Compose. Each node acts as both broker and controller — suitable for staging or smaller production workloads.

First, generate a cluster UUID. Every KRaft cluster needs one:

docker run --rm apache/kafka:3.9.0 \
  /opt/kafka/bin/kafka-storage.sh random-uuid

Save that UUID — you’ll need it for every node’s storage format command and the KAFKA_CLUSTER_ID env var.

docker-compose.yml

version: "3.9"

# Re-usable anchor for broker environment — overrides per-node below
x-kafka-common: &kafka-common
  image: apache/kafka:3.9.0
  restart: unless-stopped
  networks:
    - kafka-net

networks:
  kafka-net:
    driver: bridge

volumes:
  kafka1-data:
  kafka2-data:
  kafka3-data:

services:
  kafka1:
    <<: *kafka-common
    hostname: kafka1
    container_name: kafka1
    ports:
      - "9092:9092"   # external client listener
      - "9999:9999"   # JMX port — exposed for Prometheus scraping
    volumes:
      - kafka1-data:/var/lib/kafka/data
    environment:
      # --- KRaft identity ---
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller     # combined mode
      KAFKA_CLUSTER_ID: "REPLACE_WITH_YOUR_UUID" # same on all nodes

      # --- Listeners ---
      # CONTROLLER: internal Raft traffic
      # PLAINTEXT: internal broker-to-broker replication
      # EXTERNAL: client-facing (mapped to host port 9092)
      KAFKA_LISTENERS: CONTROLLER://0.0.0.0:9093,PLAINTEXT://0.0.0.0:29092,EXTERNAL://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka1:29092,EXTERNAL://YOUR_HOST_IP:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,EXTERNAL:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT

      # --- KRaft quorum: all 3 nodes participate as controllers ---
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093

      # --- Replication defaults ---
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2

      # --- Performance tuning ---
      KAFKA_NUM_PARTITIONS: 6                    # default for auto-created topics
      KAFKA_LOG_RETENTION_HOURS: 168             # 7 days
      KAFKA_LOG_SEGMENT_BYTES: 1073741824        # 1 GB segments
      KAFKA_LOG_RETENTION_CHECK_INTERVAL_MS: 300000

      # --- JVM / GC ---
      KAFKA_HEAP_OPTS: "-Xms4g -Xmx4g"

      # --- JMX ---
      KAFKA_JMX_PORT: 9999
      KAFKA_JMX_HOSTNAME: kafka1                 # must match container hostname for remote access
      KAFKA_OPTS: "-Dcom.sun.jndi.rmiregistry.factory.registrySocket.timeout=5000"

  kafka2:
    <<: *kafka-common
    hostname: kafka2
    container_name: kafka2
    ports:
      - "9093:9092"
      - "10000:9999"
    volumes:
      - kafka2-data:/var/lib/kafka/data
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CLUSTER_ID: "REPLACE_WITH_YOUR_UUID"
      KAFKA_LISTENERS: CONTROLLER://0.0.0.0:9093,PLAINTEXT://0.0.0.0:29092,EXTERNAL://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka2:29092,EXTERNAL://YOUR_HOST_IP:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,EXTERNAL:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
      KAFKA_NUM_PARTITIONS: 6
      KAFKA_LOG_RETENTION_HOURS: 168
      KAFKA_LOG_SEGMENT_BYTES: 1073741824
      KAFKA_HEAP_OPTS: "-Xms4g -Xmx4g"
      KAFKA_JMX_PORT: 9999
      KAFKA_JMX_HOSTNAME: kafka2
      KAFKA_OPTS: "-Dcom.sun.jndi.rmiregistry.factory.registrySocket.timeout=5000"

  kafka3:
    <<: *kafka-common
    hostname: kafka3
    container_name: kafka3
    ports:
      - "9094:9092"
      - "10001:9999"
    volumes:
      - kafka3-data:/var/lib/kafka/data
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CLUSTER_ID: "REPLACE_WITH_YOUR_UUID"
      KAFKA_LISTENERS: CONTROLLER://0.0.0.0:9093,PLAINTEXT://0.0.0.0:29092,EXTERNAL://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka3:29092,EXTERNAL://YOUR_HOST_IP:9094
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,EXTERNAL:PLAINTEXT
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
      KAFKA_NUM_PARTITIONS: 6
      KAFKA_LOG_RETENTION_HOURS: 168
      KAFKA_LOG_SEGMENT_BYTES: 1073741824
      KAFKA_HEAP_OPTS: "-Xms4g -Xmx4g"
      KAFKA_JMX_PORT: 9999
      KAFKA_JMX_HOSTNAME: kafka3
      KAFKA_OPTS: "-Dcom.sun.jndi.rmiregistry.factory.registrySocket.timeout=5000"

Start it:

docker compose up -d
# Verify all nodes see each other
docker exec kafka1 /opt/kafka/bin/kafka-broker-api-versions.sh \
  --bootstrap-server kafka1:29092 | grep Node

Gotcha: Storage Format Must Be Done Before First Boot

KRaft requires formatting the storage directory with your cluster ID before the broker can start. The official Docker image handles this automatically via KAFKA_CLUSTER_ID. But if you’re running bare-metal Kafka from a tarball, you must run this manually on every node:

./bin/kafka-storage.sh format \
  --config config/kraft/server.properties \
  --cluster-id "YOUR_UUID_HERE"

Skip this step and Kafka refuses to start. No helpful error message, just a cryptic log about a missing meta.properties. Been there.


JMX Observability: From Raw Metrics to Prometheus

JMX is Kafka’s native metrics system, and it’s both powerful and annoying. Powerful because every critical metric is there — broker throughput, request latencies, ISR shrinks, consumer lag (from the broker side). Annoying because the default JMX interface requires Java tooling to consume. Nobody wants jconsole.

The standard approach: run jmx_exporter as a Java agent inside the Kafka JVM. It reads the JMX MBeans and exposes them on an HTTP port in Prometheus format. One config file, one extra JVM flag.

JMX Exporter Config

Save this as jmx-exporter/kafka-config.yml:

# jmx_exporter config for Apache Kafka 3.x KRaft
# Collects the metrics that actually matter in production

lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
  # ---- Broker throughput (bytes in/out per second) ----
  - pattern: "kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec|MessagesInPerSec), topic=(.+)><>Count"
    name: kafka_server_brokertopicmetrics_$1_total
    labels:
      topic: "$2"
    type: COUNTER

  # ---- Request handler idle ratio — below 20% means broker is CPU-stressed ----
  - pattern: "kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>OneMinuteRate"
    name: kafka_server_request_handler_avg_idle_percent
    type: GAUGE

  # ---- Under-replicated partitions — should ALWAYS be 0 in steady state ----
  - pattern: "kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value"
    name: kafka_server_replica_manager_under_replicated_partitions
    type: GAUGE

  # ---- ISR shrinks/expands — frequent shrinks indicate network or GC pressure ----
  - pattern: "kafka.server<type=ReplicaManager, name=(IsrShrinksPerSec|IsrExpandsPerSec)><>OneMinuteRate"
    name: kafka_server_replica_manager_$1
    type: GAUGE

  # ---- Active controller count — exactly 1 is healthy, 0 means election in progress ----
  - pattern: "kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value"
    name: kafka_controller_active_controller_count
    type: GAUGE

  # ---- Log end offsets per partition — useful for lag calculation ----
  - pattern: "kafka.log<type=Log, name=LogEndOffset, topic=(.+), partition=(.+)><>Value"
    name: kafka_log_log_end_offset
    labels:
      topic: "$1"
      partition: "$2"
    type: GAUGE

  # ---- Network processor idle ratio — below 30% is a warning sign ----
  - pattern: "kafka.network<type=SocketServer, name=NetworkProcessorAvgIdlePercent><>Value"
    name: kafka_network_processor_avg_idle_percent
    type: GAUGE

  # ---- Purgatory size (delayed produce/fetch operations waiting) ----
  - pattern: "kafka.server<type=DelayedOperationPurgatory, name=PurgatorySize, delayedOperation=(.+)><>Value"
    name: kafka_server_delayed_operation_purgatory_size
    labels:
      delayed_operation: "$1"
    type: GAUGE

  # ---- JVM GC ----
  - pattern: "java.lang<type=GarbageCollector, name=(.+)><>(CollectionCount|CollectionTime)"
    name: jvm_gc_$2_total
    labels:
      gc: "$1"
    type: COUNTER

  # ---- JVM memory ----
  - pattern: "java.lang<type=Memory><HeapMemoryUsage>used"
    name: jvm_heap_memory_used_bytes
    type: GAUGE

Wiring JMX Exporter into the Container

Download the agent JAR:

mkdir -p jmx-exporter
curl -L https://github.com/prometheus/jmx_exporter/releases/download/1.1.0/jmx_prometheus_javaagent-1.1.0.jar \
  -o jmx-exporter/jmx_prometheus_javaagent.jar

Add to your Docker Compose volumes and environment:

# Add to each kafka service:
volumes:
  - ./jmx-exporter:/opt/jmx-exporter:ro
  # ... existing volumes

environment:
  # Replace the KAFKA_OPTS line with:
  KAFKA_OPTS: >-
    -javaagent:/opt/jmx-exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx-exporter/kafka-config.yml
    -Dcom.sun.jndi.rmiregistry.factory.registrySocket.timeout=5000

ports:
  - "7071:7071"  # Prometheus scrape port (adjust per node: 7071, 7072, 7073)

After restarting, verify:

curl -s https://cd-linux.club:7071/metrics | grep kafka_server_replica_manager_under_replicated
# Should return: kafka_server_replica_manager_under_replicated_partitions 0

Zero under-replicated partitions. That’s your baseline. Alert on anything above zero.


Prometheus Scrape Config

# prometheus.yml scrape section
scrape_configs:
  - job_name: "kafka"
    static_configs:
      - targets:
          - "kafka1:7071"
          - "kafka2:7072"
          - "kafka3:7073"
        labels:
          cluster: "prod-kafka"
    metrics_path: /metrics
    scrape_interval: 15s

The Five Alerts You Must Have

Set these up before anything goes to production. If you skip alerts and only look at dashboards, you’ll find out about problems from your users.

1. Under-replicated partitions

alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replica_manager_under_replicated_partitions > 0
for: 2m
severity: critical

2. No active controller

alert: KafkaNoActiveController
expr: sum(kafka_controller_active_controller_count) != 1
for: 1m
severity: critical

3. Request handler saturation

alert: KafkaRequestHandlerSaturated
expr: kafka_server_request_handler_avg_idle_percent < 0.2
for: 5m
severity: warning

4. ISR shrinks spiking

alert: KafkaIsrShrinkRate
expr: rate(kafka_server_replica_manager_IsrShrinksPerSec[5m]) > 0.1
for: 5m
severity: warning

5. Consumer group lag (use kminion or kafka-consumer-groups.sh — JMX doesn’t expose consumer lag directly, that’s a common confusion)

For consumer lag, add Kminion to your stack — it reads consumer group offsets via the Kafka API and exposes them for Prometheus. It’s lighter than Kafka’s own consumer group exporter.


Gotcha: KAFKA_JMX_HOSTNAME Must Be Resolvable

Remote JMX (the raw RMI protocol, not the exporter) breaks silently if KAFKA_JMX_HOSTNAME doesn’t resolve from the client machine. With Docker, set it to the container hostname, not localhost. The JMX agent communicates the hostname back to the client during the handshake — a mismatch causes a connection timeout that looks like a firewall issue but isn’t.

If you’re exposing raw JMX outside Docker for tools like JConsole, you’ll also need to set KAFKA_JMX_OPTS explicitly to disable authentication (-Dcom.sun.management.jmxremote.authenticate=false) — but only do this on isolated networks. Raw JMX with no auth is a security hole.


Gotcha: Log Directories and Filesystem Mount Points

Kafka’s write pattern is sequential appends to log segments. It will saturate a single disk fast. On bare metal, put your log.dirs on a dedicated mount — separate from the OS volume. On Docker, use named volumes backed by local NVMe, not networked storage (NFS, Ceph, EFS). The fsync latency from networked storage will kill your producer throughput and cause timeout cascades.

If you need HA at the storage layer, let Kafka’s replication handle it. That’s what the replication factor is for. Don’t put Kafka data on replicated block devices — you’re paying the I/O cost twice.


Production-Ready: Rack Awareness

When you have brokers spread across physical hosts or availability zones, enable rack awareness so partition replicas don’t all land on the same physical machine:

# server.properties per broker — set to the AZ or rack label
broker.rack=az1   # or az2, az3 on the other nodes

When creating topics, set --replication-factor 3 and Kafka’s replica assignment algorithm will distribute replicas across racks. Without this, you can have RF=3 and still lose all three replicas in a single host failure if the scheduler assigned them together.


Graceful Shutdown and Rolling Restarts

Never kill -9 a Kafka broker. A dirty shutdown forces a full log recovery on the next start, which can take minutes on large partitions. Use SIGTERM and let the broker finish in-flight requests and transfer leadership:

docker exec kafka1 /opt/kafka/bin/kafka-server-stop.sh
# Wait for container to exit, then restart

For rolling restarts (version upgrades, config changes), always verify ISR is fully caught up before moving to the next node:

docker exec kafka1 /opt/kafka/bin/kafka-topics.sh \
  --bootstrap-server kafka1:29092 \
  --describe \
  --under-replicated-partitions
# Must return nothing before you restart the next node

Rushing a rolling restart while replicas are catching up is how you turn a maintenance window into an incident.


Where to Go From Here

A running cluster with JMX metrics is the foundation. From here, the natural next steps are:

TLS + SASL authentication — right now this cluster runs plaintext with no auth. Fine for an isolated internal network; unacceptable if clients connect over any shared network. Kafka supports SASL/SCRAM and mTLS.

Schema Registry — if you’re using Avro or Protobuf, run Confluent Schema Registry (Apache-licensed) alongside Kafka. It prevents producers and consumers from breaking each other on schema changes.

Kafka UIProvectus Kafka UI is the best open-source web interface for browsing topics, consumer groups, and configs. Drop it into the same Compose stack.

Tiered storage — Kafka 3.6+ supports tiered storage (offloading older segments to S3/GCS). If your retention requirements are long and disk is expensive, this is worth evaluating.

The KRaft migration path is clear, the tooling is mature, and the operational complexity without ZooKeeper is genuinely lower. The hardest part is the initial sizing — get that wrong and you’re retrofitting disks and heap configs under load. Get it right upfront and this cluster will be one of the quieter pieces of your infrastructure.

Leave a comment

👁 Views: 2,290 · Unique visitors: 1,647