From CrashLoops to Clarity: A Kafka Story with Punchlines and Pods
๐ The Timeline (Kafka Chronicles)
-
Day 1:
- Installed Lens, deployed initial YAMLs, faced mysterious pod crash loops with no logs — panic set in.
- Started brute-force debugging using
kubectl
, then switched to manual container testing viactr
.
-
Day 2:
- Narrowed down the issue to Bitnami image defaults, fought KRaft mode, fixed image version.
- Built dynamic environment variable injection for
broker.id
,node.id
, and advertised listeners.
-
Day 3:
- Introduced authentication via secrets, tested producers/consumers.
- Added REST proxy, tested with cluster ID and REST API.
- Finalized all YAML files and ran celebratory
kafka-topics.sh --create
like champions.
๐ซ Phase 0: The Glorious (and Naive) Start
Goal: Spin up a 3-node Apache Kafka cluster using Kubernetes (via Lens), Zookeeper-based (no KRaft), no PVCs, exposed via services and REST proxy.
Initial Stack:
-
Bitnami Kafka Docker image
-
Bitnami Zookeeper image
-
Kubernetes Lens
-
REST Proxy image (confluentinc)
-
Secrets via Kubernetes
Expectation: Easy.
Reality:
"Back-off restarting failed container kafka"
That one line haunted our logs for hours with no output. That single error message forced us to roll up our sleeves, dive into the container runtime, and take control like mad engineers on a runaway train.
๐ Phase 1: Let the Logs Speak (and Yell)
Kafka wasn't happy. Errors included:
-
Kafka requires at least one process role to be set
-
KRaft mode requires a unique node.id
-
KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: unbound variable
๐คฆ Root Causes:
-
Bitnami defaults to KRaft, even if we want Zookeeper.
-
KAFKA_ENABLE_KRAFT=no
was ignored in early attempts. -
Forgot
KAFKA_CFG_NODE_ID
andKAFKA_CFG_PROCESS_ROLES
We started building dynamic broker/node IDs using this trick:
ordinal=${HOSTNAME##*-}
export KAFKA_CFG_NODE_ID=$ordinal
export KAFKA_BROKER_ID=$ordinal
๐ Phase 2: Debugging the Bitnami Beast
We went full bare-metal using ctr
to run Kafka containers manually with all debug flags:
ctr --debug run --rm -t \
--hostname kafka-test-0 \
--label name=kafka-test \
--env BITNAMI_DEBUG=true \
--env KAFKA_LOG4J_ROOT_LOGLEVEL=DEBUG \
--env KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181 \
--env KAFKA_BROKER_ID=0 \
--env ALLOW_PLAINTEXT_LISTENER=yes \
--env KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT \
--env KAFKA_CFG_PROCESS_ROLES=broker \
--env KAFKA_ENABLE_KRAFT=false \
--env KAFKA_CLUSTER_ID=kafka-test-cluster-1234 \
--env KAFKA_CFG_NODE_ID=0 \
docker.io/bitnami/kafka:latest kafka-test
Based on the generated errors, I found out that whatever I do KRAFT is always enabled so I found out that I need to go to a previous version of bitnami/Kafka, I checked the helm chart and if you scroll down after the flags you can find out the version of Kafka used for each image and for that I needed to go to version: kafka:3.9.0-debian-12-r13
✅ Switching to Kafka
3.9.0
fixed issues — Kafka 4.0 enforces KRaft, breaking Zookeeper compatibility
๐ Phase 3: Securing It with Secrets (and Sanity)
We created a Kubernetes Secret with the following mapping:
User | Password |
---|---|
producer | producer@kafkaserver |
consumer | consumer@kafkaserver |
apiVersion: v1
kind: Secret
metadata:
name: kafka-auth-secret
namespace: kafka-server
labels:
app: kafka
component: auth
type: Opaque
data:
KAFKA_CLIENT_USERS: cHJvZHVjZXIsY29uc3VtZXI=
KAFKA_CLIENT_PASSWORDS: cHJvZHVjZXJAa2Fma2FzZXJ2ZXIsY29uc3VtZXJAa2Fma2FzZXJ2ZXI=
Then injected into the StatefulSet via env.valueFrom.secretKeyRef
.
✅ To generate the data values for KAFKA_CLIENT_USERS and KAFKA_CLIENT_PASSWORDS you need to use base64 command
echo -n 'producer@kafkaserver,consumer@kafkaserver' | base64
echo -n 'producer,consumer' | base64
๐งช Phase 4: Testing Inside the Pod
To validate authentication, we ran:
Producer
kafka-console-producer.sh \
--broker-list kafka-0.kafka.kafka-server.svc.cluster.local:9092 \
--topic auth-test-topic \
--producer.config <(echo "...with producer JAAS config...")
Consumer
kafka-console-consumer.sh \
--bootstrap-server kafka-0.kafka.kafka-server.svc.cluster.local:9092 \
--topic auth-test-topic \
--from-beginning \
--consumer.config <(echo "...with consumer JAAS config...")
๐ Phase 5: REST Proxy Integration
We deployed Kafka REST Proxy and queried the cluster ID:
curl http://kafka-rest-proxy.kafka-server.svc.cluster.local:8082/v3/clusters
Then used:
curl -X POST -H "Content-Type: application/vnd.kafka.v3+json" \
--data '{"topic_name": "test-topic", "partitions": 1, "replication_factor": 1}' \
http://kafka-rest-proxy.kafka-server.svc.cluster.local:8082/v3/clusters/<cluster-id>/topics
✅ It worked -- Eventually
๐ง Technical Findings
-
Bitnami Kafka 4.0+ forces KRaft — use 3.9.0 for Zookeeper.
-
Set
KAFKA_ENABLE_KRAFT=no
, but also avoid any KRaft-only vars. -
Use
ordinal
to dynamically assignbroker.id
andnode.id
. -
Zookeeper must be deployed before Kafka.
-
REST Proxy works fine once cluster ID is fetched and correct Content-Type is used.
-
The
Back-off restarting failed container
error with no logs can often mean missing required env vars.
✅ Final Working YAMLs
Kafka StatefulSet (3 Nodes)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
namespace: kafka-server
spec:
serviceName: kafka
replicas: 3
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
terminationGracePeriodSeconds: 10
containers:
- name: kafka
image: bitnami/kafka:3.9.0-debian-12-r13
ports:
- containerPort: 9092
env:
- name: KAFKA_CFG_ZOOKEEPER_CONNECT
value: "zookeeper.kafka-server.svc.cluster.local:2181"
- name: KAFKA_CFG_LISTENERS
value: "PLAINTEXT://:9092"
- name: KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP
value: "PLAINTEXT:PLAINTEXT"
- name: KAFKA_CFG_INTER_BROKER_LISTENER_NAME
value: "PLAINTEXT"
- name: KAFKA_ENABLE_KRAFT
value: "no"
- name: ALLOW_PLAINTEXT_LISTENER
value: "yes"
- name: KAFKA_CLIENT_USERS
valueFrom:
secretKeyRef:
name: kafka-auth-secret
key: KAFKA_CLIENT_USERS
- name: KAFKA_CLIENT_PASSWORDS
valueFrom:
secretKeyRef:
name: kafka-auth-secret
key: KAFKA_CLIENT_PASSWORDS
- name: BITNAMI_DEBUG
value: "true"
- name: KAFKA_LOG4J_ROOT_LOGLEVEL
value: "DEBUG"
command:
- bash
- -c
- |
ordinal=${HOSTNAME##*-}
export KAFKA_CFG_NODE_ID=$ordinal
export KAFKA_BROKER_ID=$ordinal
export KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://${HOSTNAME}.kafka.kafka-server.svc.cluster.local:9092
exec /opt/bitnami/scripts/kafka/entrypoint.sh /opt/bitnami/scripts/kafka/run.sh
✅ For production make sure to remove DEBUG related configurations
Zookeeper StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zookeeper
namespace: kafka-server
spec:
serviceName: zookeeper
replicas: 1
selector:
matchLabels:
app: zookeeper
template:
metadata:
labels:
app: zookeeper
spec:
terminationGracePeriodSeconds: 10
containers:
- name: zookeeper
image: bitnami/zookeeper:latest
ports:
- containerPort: 2181
name: client
env:
- name: ALLOW_ANONYMOUS_LOGIN
value: "yes"
REST Proxy Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: kafka-rest-proxy
namespace: kafka-server
spec:
replicas: 1
selector:
matchLabels:
app: kafka-rest-proxy
template:
metadata:
labels:
app: kafka-rest-proxy
spec:
containers:
- name: rest-proxy
image: confluentinc/cp-kafka-rest:7.4.0
ports:
- containerPort: 8082
env:
- name: KAFKA_REST_BOOTSTRAP_SERVERS
value: PLAINTEXT://kafka.kafka-server.svc.cluster.local:9092
- name: KAFKA_REST_HOST_NAME
value: kafka-rest-proxy
Services
---
apiVersion: v1
kind: Service
metadata:
name: kafka
namespace: kafka-server
spec:
clusterIP: None
selector:
app: kafka
ports:
- name: kafka
port: 9092
targetPort: 9092
---
apiVersion: v1
kind: Service
metadata:
name: zookeeper
namespace: kafka-server
spec:
clusterIP: None
selector:
app: zookeeper
ports:
- port: 2181
targetPort: 2181
name: client
---
apiVersion: v1
kind: Service
metadata:
name: kafka-rest-proxy
namespace: kafka-server
spec:
type: ClusterIP
selector:
app: kafka-rest-proxy
ports:
- port: 8082
targetPort: 8082