What Is Apache Kafka? A Plain English Explanation for Data Engineers
Apache Kafka appears in almost every senior data engineering job description. Most beginners understand it as a message queue but that undersells what it actually is and why it changed how companies build data pipelines.
Kafka in one sentence
Kafka is a distributed, durable, high-throughput event streaming platform. Producers write events to Kafka topics. Consumers read from those topics at their own pace. Events are stored durably — not deleted after consumption — so multiple systems can read the same events independently.
Why Kafka instead of a database or message queue
A database stores current state. Kafka stores the history of events that created that state. This is the fundamental difference.
A message queue (RabbitMQ, SQS) delivers a message once and deletes it. Kafka retains messages for days or weeks. Multiple consumers read the same message independently without affecting each other.
This means: a single stream of user clickstream events can simultaneously feed a real-time dashboard, a fraud detection model, and a batch analytics pipeline — all reading from the same Kafka topic at different speeds.
Core concepts
Topic: a named, ordered log of events. Like a table but append-only.
Partition: topics are split into partitions for parallelism. A topic with 12 partitions supports 12 consumers reading in parallel.
Consumer group: a group of consumers that cooperate to read a topic — each partition is assigned to exactly one consumer in the group. Add more consumers to increase throughput.
Offset: each message has a position number in its partition. Consumers track their offset — if a consumer restarts, it resumes from where it left off. This enables exactly-once processing semantics.
Kafka in cloud data engineering
Cloud-managed Kafka equivalents:
Azure: Azure Event Hubs (Kafka-compatible API — same code works)
AWS: Amazon Kinesis (different API) or Amazon MSK (managed Kafka)
GCP: Google Pub/Sub (different API) or Confluent Cloud
For learning: understanding Kafka concepts prepares you for all of them. The partition, consumer group, and offset model is identical on Event Hubs and MSK.