Running Kafka in production without ZooKeeper used to be a fantasy. For years, every Kafka deployment dragged along a ZooKeeper ensemble — three more nodes, three more failure domains, three more things to wake you up at 2 AM. Kafka 3.3 made KRaft stable for production, and Kafka 3.5+ deprecated ZooKeeper entirely. If you’re still managing a ZK quorum in 2026, you’re carrying technical debt that will cost you.
This guide walks through a complete self-hosted Kafka setup: KRaft mode from scratch, sizing your cluster based on actual workload math, and wiring JMX metrics into Prometheus so you can see what’s happening before something breaks. No managed service upsell. No hand-waving about "it depends". Real configuration you can adapt and deploy today.
Official GitHub: github.com/apache/kafka
Why KRaft Changes Everything
The old architecture forced a hard coupling between Kafka and ZooKeeper. Metadata (topic configs, partition assignments, ISR state) lived in ZK, meaning every broker leaderboard election had to touch an external system. This created a ceiling — large clusters with hundreds of thousands of partitions hit ZK coordination limits and became genuinely painful to manage.
KRaft replaces ZooKeeper with a Raft-based metadata quorum built directly into Kafka itself. A subset of brokers (or dedicated controller nodes) participate in the quorum. Metadata is stored in an internal __cluster_metadata topic. The result is faster controller failover (seconds instead of tens of seconds), support for millions of partitions, and one fewer operational dependency.
There’s a mode distinction worth knowing up front: brokers can run as broker, controller, or combined (both). For small clusters (under 5 nodes), combined mode is fine. For anything production-serious, run dedicated controller nodes. More on this in the sizing section.
Cluster Sizing: The Numbers That Actually Matter
Before touching a config file, answer three questions:
1. What’s your peak throughput?
Kafka is I/O-bound. A single broker on decent hardware (8 cores, NVMe, 10 GbE) can handle 200–400 MB/s sustained write throughput. Replication multiplies that — a replication factor of 3 means 3x the I/O at the broker holding the leader.
2. What’s your retention window?
Kafka stores data on disk. retention.ms × peak write rate = your minimum disk requirement. Add 30% headroom. Don’t forget that log compaction topics retain data differently — they don’t expire by time, they retain only the latest value per key.
3. How many partitions do you need?
Partitions are the unit of parallelism. More consumer instances → more partitions needed. Rule of thumb: max(desired_consumer_parallelism, producer_concurrency). Don’t over-partition — each partition consumes memory on every broker (roughly 1 MB per partition per broker for index files). A cluster with 10,000 partitions is manageable. 500,000 is not unless you’ve done your homework.
Sizing Templates
| Cluster Size | Workload | Nodes | vCPU/node | RAM/node | Disk/node |
|---|---|---|---|---|---|
| Small | < 50 MB/s, dev/staging | 3 combined | 4 | 8 GB | 500 GB NVMe |
| Medium | 50–200 MB/s, prod | 3 brokers + 3 controllers | 8 | 16 GB | 2 TB NVMe |
| Large | 200 MB/s+, high-throughput | 6+ brokers + 3 controllers | 16 | 32 GB | 4 TB NVMe RAID |
For the medium and large tiers, use dedicated controller nodes. They’re lightweight — a controller quorum handles only metadata, not data path traffic. A 4 vCPU / 8 GB controller node is sufficient.
JVM heap sizing: Set -Xms and -Xmx equal to prevent heap resizing. For brokers, 6 GB heap is a reasonable starting point. Don’t go above 8–10 GB — large heaps cause long GC pauses. Kafka relies heavily on the OS page cache, so leave the rest of RAM for the OS to cache disk I/O.
Setting Up KRaft: Docker Compose That Actually Works
Let’s build a 3-node combined KRaft cluster with Docker Compose. Each node acts as both broker and controller — suitable for staging or smaller production workloads.
First, generate a cluster UUID. Every KRaft cluster needs one:
docker run --rm apache/kafka:3.9.0 \
/opt/kafka/bin/kafka-storage.sh random-uuid
Save that UUID — you’ll need it for every node’s storage format command and the KAFKA_CLUSTER_ID env var.
docker-compose.yml
version: "3.9"
# Re-usable anchor for broker environment — overrides per-node below
x-kafka-common: &kafka-common
image: apache/kafka:3.9.0
restart: unless-stopped
networks:
- kafka-net
networks:
kafka-net:
driver: bridge
volumes:
kafka1-data:
kafka2-data:
kafka3-data:
services:
kafka1:
<<: *kafka-common
hostname: kafka1
container_name: kafka1
ports:
- "9092:9092" # external client listener
- "9999:9999" # JMX port — exposed for Prometheus scraping
volumes:
- kafka1-data:/var/lib/kafka/data
environment:
# --- KRaft identity ---
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller # combined mode
KAFKA_CLUSTER_ID: "REPLACE_WITH_YOUR_UUID" # same on all nodes
# --- Listeners ---
# CONTROLLER: internal Raft traffic
# PLAINTEXT: internal broker-to-broker replication
# EXTERNAL: client-facing (mapped to host port 9092)
KAFKA_LISTENERS: CONTROLLER://0.0.0.0:9093,PLAINTEXT://0.0.0.0:29092,EXTERNAL://0.0.0.0:9092
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka1:29092,EXTERNAL://YOUR_HOST_IP:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,EXTERNAL:PLAINTEXT
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
# --- KRaft quorum: all 3 nodes participate as controllers ---
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
# --- Replication defaults ---
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
# --- Performance tuning ---
KAFKA_NUM_PARTITIONS: 6 # default for auto-created topics
KAFKA_LOG_RETENTION_HOURS: 168 # 7 days
KAFKA_LOG_SEGMENT_BYTES: 1073741824 # 1 GB segments
KAFKA_LOG_RETENTION_CHECK_INTERVAL_MS: 300000
# --- JVM / GC ---
KAFKA_HEAP_OPTS: "-Xms4g -Xmx4g"
# --- JMX ---
KAFKA_JMX_PORT: 9999
KAFKA_JMX_HOSTNAME: kafka1 # must match container hostname for remote access
KAFKA_OPTS: "-Dcom.sun.jndi.rmiregistry.factory.registrySocket.timeout=5000"
kafka2:
<<: *kafka-common
hostname: kafka2
container_name: kafka2
ports:
- "9093:9092"
- "10000:9999"
volumes:
- kafka2-data:/var/lib/kafka/data
environment:
KAFKA_NODE_ID: 2
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_CLUSTER_ID: "REPLACE_WITH_YOUR_UUID"
KAFKA_LISTENERS: CONTROLLER://0.0.0.0:9093,PLAINTEXT://0.0.0.0:29092,EXTERNAL://0.0.0.0:9092
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka2:29092,EXTERNAL://YOUR_HOST_IP:9093
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,EXTERNAL:PLAINTEXT
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
KAFKA_NUM_PARTITIONS: 6
KAFKA_LOG_RETENTION_HOURS: 168
KAFKA_LOG_SEGMENT_BYTES: 1073741824
KAFKA_HEAP_OPTS: "-Xms4g -Xmx4g"
KAFKA_JMX_PORT: 9999
KAFKA_JMX_HOSTNAME: kafka2
KAFKA_OPTS: "-Dcom.sun.jndi.rmiregistry.factory.registrySocket.timeout=5000"
kafka3:
<<: *kafka-common
hostname: kafka3
container_name: kafka3
ports:
- "9094:9092"
- "10001:9999"
volumes:
- kafka3-data:/var/lib/kafka/data
environment:
KAFKA_NODE_ID: 3
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_CLUSTER_ID: "REPLACE_WITH_YOUR_UUID"
KAFKA_LISTENERS: CONTROLLER://0.0.0.0:9093,PLAINTEXT://0.0.0.0:29092,EXTERNAL://0.0.0.0:9092
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka3:29092,EXTERNAL://YOUR_HOST_IP:9094
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,EXTERNAL:PLAINTEXT
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
KAFKA_NUM_PARTITIONS: 6
KAFKA_LOG_RETENTION_HOURS: 168
KAFKA_LOG_SEGMENT_BYTES: 1073741824
KAFKA_HEAP_OPTS: "-Xms4g -Xmx4g"
KAFKA_JMX_PORT: 9999
KAFKA_JMX_HOSTNAME: kafka3
KAFKA_OPTS: "-Dcom.sun.jndi.rmiregistry.factory.registrySocket.timeout=5000"
Start it:
docker compose up -d
# Verify all nodes see each other
docker exec kafka1 /opt/kafka/bin/kafka-broker-api-versions.sh \
--bootstrap-server kafka1:29092 | grep Node
Gotcha: Storage Format Must Be Done Before First Boot
KRaft requires formatting the storage directory with your cluster ID before the broker can start. The official Docker image handles this automatically via KAFKA_CLUSTER_ID. But if you’re running bare-metal Kafka from a tarball, you must run this manually on every node:
./bin/kafka-storage.sh format \
--config config/kraft/server.properties \
--cluster-id "YOUR_UUID_HERE"
Skip this step and Kafka refuses to start. No helpful error message, just a cryptic log about a missing meta.properties. Been there.
JMX Observability: From Raw Metrics to Prometheus
JMX is Kafka’s native metrics system, and it’s both powerful and annoying. Powerful because every critical metric is there — broker throughput, request latencies, ISR shrinks, consumer lag (from the broker side). Annoying because the default JMX interface requires Java tooling to consume. Nobody wants jconsole.
The standard approach: run jmx_exporter as a Java agent inside the Kafka JVM. It reads the JMX MBeans and exposes them on an HTTP port in Prometheus format. One config file, one extra JVM flag.
JMX Exporter Config
Save this as jmx-exporter/kafka-config.yml:
# jmx_exporter config for Apache Kafka 3.x KRaft
# Collects the metrics that actually matter in production
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# ---- Broker throughput (bytes in/out per second) ----
- pattern: "kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec|MessagesInPerSec), topic=(.+)><>Count"
name: kafka_server_brokertopicmetrics_$1_total
labels:
topic: "$2"
type: COUNTER
# ---- Request handler idle ratio — below 20% means broker is CPU-stressed ----
- pattern: "kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>OneMinuteRate"
name: kafka_server_request_handler_avg_idle_percent
type: GAUGE
# ---- Under-replicated partitions — should ALWAYS be 0 in steady state ----
- pattern: "kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value"
name: kafka_server_replica_manager_under_replicated_partitions
type: GAUGE
# ---- ISR shrinks/expands — frequent shrinks indicate network or GC pressure ----
- pattern: "kafka.server<type=ReplicaManager, name=(IsrShrinksPerSec|IsrExpandsPerSec)><>OneMinuteRate"
name: kafka_server_replica_manager_$1
type: GAUGE
# ---- Active controller count — exactly 1 is healthy, 0 means election in progress ----
- pattern: "kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value"
name: kafka_controller_active_controller_count
type: GAUGE
# ---- Log end offsets per partition — useful for lag calculation ----
- pattern: "kafka.log<type=Log, name=LogEndOffset, topic=(.+), partition=(.+)><>Value"
name: kafka_log_log_end_offset
labels:
topic: "$1"
partition: "$2"
type: GAUGE
# ---- Network processor idle ratio — below 30% is a warning sign ----
- pattern: "kafka.network<type=SocketServer, name=NetworkProcessorAvgIdlePercent><>Value"
name: kafka_network_processor_avg_idle_percent
type: GAUGE
# ---- Purgatory size (delayed produce/fetch operations waiting) ----
- pattern: "kafka.server<type=DelayedOperationPurgatory, name=PurgatorySize, delayedOperation=(.+)><>Value"
name: kafka_server_delayed_operation_purgatory_size
labels:
delayed_operation: "$1"
type: GAUGE
# ---- JVM GC ----
- pattern: "java.lang<type=GarbageCollector, name=(.+)><>(CollectionCount|CollectionTime)"
name: jvm_gc_$2_total
labels:
gc: "$1"
type: COUNTER
# ---- JVM memory ----
- pattern: "java.lang<type=Memory><HeapMemoryUsage>used"
name: jvm_heap_memory_used_bytes
type: GAUGE
Wiring JMX Exporter into the Container
Download the agent JAR:
mkdir -p jmx-exporter
curl -L https://github.com/prometheus/jmx_exporter/releases/download/1.1.0/jmx_prometheus_javaagent-1.1.0.jar \
-o jmx-exporter/jmx_prometheus_javaagent.jar
Add to your Docker Compose volumes and environment:
# Add to each kafka service:
volumes:
- ./jmx-exporter:/opt/jmx-exporter:ro
# ... existing volumes
environment:
# Replace the KAFKA_OPTS line with:
KAFKA_OPTS: >-
-javaagent:/opt/jmx-exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx-exporter/kafka-config.yml
-Dcom.sun.jndi.rmiregistry.factory.registrySocket.timeout=5000
ports:
- "7071:7071" # Prometheus scrape port (adjust per node: 7071, 7072, 7073)
After restarting, verify:
curl -s https://cd-linux.club:7071/metrics | grep kafka_server_replica_manager_under_replicated
# Should return: kafka_server_replica_manager_under_replicated_partitions 0
Zero under-replicated partitions. That’s your baseline. Alert on anything above zero.
Prometheus Scrape Config
# prometheus.yml scrape section
scrape_configs:
- job_name: "kafka"
static_configs:
- targets:
- "kafka1:7071"
- "kafka2:7072"
- "kafka3:7073"
labels:
cluster: "prod-kafka"
metrics_path: /metrics
scrape_interval: 15s
The Five Alerts You Must Have
Set these up before anything goes to production. If you skip alerts and only look at dashboards, you’ll find out about problems from your users.
1. Under-replicated partitions
alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replica_manager_under_replicated_partitions > 0
for: 2m
severity: critical
2. No active controller
alert: KafkaNoActiveController
expr: sum(kafka_controller_active_controller_count) != 1
for: 1m
severity: critical
3. Request handler saturation
alert: KafkaRequestHandlerSaturated
expr: kafka_server_request_handler_avg_idle_percent < 0.2
for: 5m
severity: warning
4. ISR shrinks spiking
alert: KafkaIsrShrinkRate
expr: rate(kafka_server_replica_manager_IsrShrinksPerSec[5m]) > 0.1
for: 5m
severity: warning
5. Consumer group lag (use kminion or kafka-consumer-groups.sh — JMX doesn’t expose consumer lag directly, that’s a common confusion)
For consumer lag, add Kminion to your stack — it reads consumer group offsets via the Kafka API and exposes them for Prometheus. It’s lighter than Kafka’s own consumer group exporter.
Gotcha: KAFKA_JMX_HOSTNAME Must Be Resolvable
Remote JMX (the raw RMI protocol, not the exporter) breaks silently if KAFKA_JMX_HOSTNAME doesn’t resolve from the client machine. With Docker, set it to the container hostname, not localhost. The JMX agent communicates the hostname back to the client during the handshake — a mismatch causes a connection timeout that looks like a firewall issue but isn’t.
If you’re exposing raw JMX outside Docker for tools like JConsole, you’ll also need to set KAFKA_JMX_OPTS explicitly to disable authentication (-Dcom.sun.management.jmxremote.authenticate=false) — but only do this on isolated networks. Raw JMX with no auth is a security hole.
Gotcha: Log Directories and Filesystem Mount Points
Kafka’s write pattern is sequential appends to log segments. It will saturate a single disk fast. On bare metal, put your log.dirs on a dedicated mount — separate from the OS volume. On Docker, use named volumes backed by local NVMe, not networked storage (NFS, Ceph, EFS). The fsync latency from networked storage will kill your producer throughput and cause timeout cascades.
If you need HA at the storage layer, let Kafka’s replication handle it. That’s what the replication factor is for. Don’t put Kafka data on replicated block devices — you’re paying the I/O cost twice.
Production-Ready: Rack Awareness
When you have brokers spread across physical hosts or availability zones, enable rack awareness so partition replicas don’t all land on the same physical machine:
# server.properties per broker — set to the AZ or rack label
broker.rack=az1 # or az2, az3 on the other nodes
When creating topics, set --replication-factor 3 and Kafka’s replica assignment algorithm will distribute replicas across racks. Without this, you can have RF=3 and still lose all three replicas in a single host failure if the scheduler assigned them together.
Graceful Shutdown and Rolling Restarts
Never kill -9 a Kafka broker. A dirty shutdown forces a full log recovery on the next start, which can take minutes on large partitions. Use SIGTERM and let the broker finish in-flight requests and transfer leadership:
docker exec kafka1 /opt/kafka/bin/kafka-server-stop.sh
# Wait for container to exit, then restart
For rolling restarts (version upgrades, config changes), always verify ISR is fully caught up before moving to the next node:
docker exec kafka1 /opt/kafka/bin/kafka-topics.sh \
--bootstrap-server kafka1:29092 \
--describe \
--under-replicated-partitions
# Must return nothing before you restart the next node
Rushing a rolling restart while replicas are catching up is how you turn a maintenance window into an incident.
Where to Go From Here
A running cluster with JMX metrics is the foundation. From here, the natural next steps are:
TLS + SASL authentication — right now this cluster runs plaintext with no auth. Fine for an isolated internal network; unacceptable if clients connect over any shared network. Kafka supports SASL/SCRAM and mTLS.
Schema Registry — if you’re using Avro or Protobuf, run Confluent Schema Registry (Apache-licensed) alongside Kafka. It prevents producers and consumers from breaking each other on schema changes.
Kafka UI — Provectus Kafka UI is the best open-source web interface for browsing topics, consumer groups, and configs. Drop it into the same Compose stack.
Tiered storage — Kafka 3.6+ supports tiered storage (offloading older segments to S3/GCS). If your retention requirements are long and disk is expensive, this is worth evaluating.
The KRaft migration path is clear, the tooling is mature, and the operational complexity without ZooKeeper is genuinely lower. The hardest part is the initial sizing — get that wrong and you’re retrofitting disks and heap configs under load. Get it right upfront and this cluster will be one of the quieter pieces of your infrastructure.