Azure Event Hubs
Event Hubs is Azure's managed event streaming service — the Azure-native equivalent of Apache Kafka. It ingests millions of events per second from applications, IoT devices, and microservices, then makes them available for real-time processing or batch archiving.
What is Azure Event Hubs?
Event Hubs is a fully managed, real-time data ingestion service. Producers send events (clicks, transactions, sensor readings, logs) into Event Hubs. Consumers read those events and process them — either in real time or in batches after they have been archived to storage.
The key thing to understand about Event Hubs: it is Kafka-compatible. You can use the Kafka protocol to produce and consume events without changing your application code. If you know Kafka, you know Event Hubs.
Core Concepts
The top-level container for Event Hubs — like a server. One namespace can contain multiple Event Hubs. The namespace has the connection string.
A single stream/topic inside a namespace. Equivalent to a Kafka topic. One Event Hub typically represents one data source (e.g. clickstream, transactions).
Events are distributed across partitions for parallelism. Each partition is an ordered, immutable sequence of events. More partitions = more throughput.
A view of the entire Event Hub for one application. Multiple consumer groups read the same events independently — a streaming pipeline and an ML model can both read the same stream.
A saved position in the stream. When a consumer restarts, it resumes from the last checkpoint instead of reprocessing everything from the start.
Auto-archive feature. Automatically writes incoming events to ADLS Gen2 or Blob Storage as Avro files on a time/size schedule — no consumer code needed.
Sending Events — Producer Code
Producers send events in batches for efficiency. Each event is a JSON payload (or any binary format) representing something that happened — a sale, a click, a sensor reading.
# Send events to Event Hubs using Python SDK
# pip install azure-eventhub
import asyncio
import json
from azure.eventhub.aio import EventHubProducerClient
from azure.eventhub import EventData
from datetime import datetime
CONNECTION_STRING = "Endpoint=sb://yournamespace.servicebus.windows.net/;..."
EVENT_HUB_NAME = "sales-events"
async def send_events():
producer = EventHubProducerClient.from_connection_string(
conn_str=CONNECTION_STRING,
eventhub_name=EVENT_HUB_NAME
)
async with producer:
# Create a batch — Event Hubs sends in batches for efficiency
event_batch = await producer.create_batch()
# Add events to the batch
for i in range(10):
payload = {
"event_id": f"EVT-{i:04d}",
"customer_id": f"CUST-{i * 100}",
"amount": round(99.99 + i * 10, 2),
"timestamp": datetime.utcnow().isoformat(),
"event_type": "purchase"
}
event_batch.add(EventData(json.dumps(payload)))
# Send the whole batch in one network call
await producer.send_batch(event_batch)
print(f"Sent batch of {len(event_batch)} events")
asyncio.run(send_events())Consuming Events — Consumer Code
Consumers read events and process them. The checkpoint store ensures that if your consumer crashes and restarts, it picks up exactly where it left off — no events are missed or reprocessed.
# Read events from Event Hubs — consumer group pattern
# Multiple consumers can read the same stream independently
import asyncio
from azure.eventhub.aio import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblobaio import BlobCheckpointStore
# Checkpoint store tracks which events have been processed
# If consumer restarts, it resumes from last checkpoint — not from the start
CHECKPOINT_STORE = BlobCheckpointStore.from_connection_string(
conn_str="DefaultEndpointsProtocol=https;AccountName=yourstorage;...",
container_name="eventhub-checkpoints"
)
async def on_event(partition_context, event):
# Process the event
payload = json.loads(event.body_as_str())
print(f"Partition {partition_context.partition_id}: {payload['event_id']}")
# Write to database, trigger downstream processing, etc.
# ...
# Checkpoint — marks this event as processed
await partition_context.update_checkpoint(event)
async def consume():
client = EventHubConsumerClient.from_connection_string(
conn_str=CONNECTION_STRING,
consumer_group="$Default", # or a named consumer group
eventhub_name=EVENT_HUB_NAME,
checkpoint_store=CHECKPOINT_STORE, # enables resume on restart
)
async with client:
await client.receive(
on_event=on_event,
starting_position="-1", # -1 = read from beginning of stream
)
asyncio.run(consume())Event Hubs Capture — Archive to ADLS Gen2
Capture is the simplest way to get streaming data into your data lake. Enable it in the Azure Portal and Event Hubs automatically writes incoming events to ADLS Gen2 as Avro files — no consumer code needed. Then Databricks reads the archived files for batch processing.
// Event Hubs Capture — automatically archive events to ADLS Gen2
// Configure in Azure Portal: Event Hub → Features → Capture
// Capture writes Avro files to your storage account in this path format:
// {Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}.avro
// Example file path:
// myNamespace/sales-events/0/2025/03/15/14/30/00.avro
// Read captured Avro files in Databricks:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ReadEventHubCapture").getOrCreate()
# Read all captured Avro files for today
df = spark.read.format("avro").load(
"abfss://capture@yourstorage.dfs.core.windows.net/myNamespace/sales-events/*/2025/03/15/*/*/*.avro"
)
# Event Hubs Capture wraps your payload in a Body column (binary)
import pyspark.sql.functions as F
df_parsed = df.withColumn(
"payload",
F.from_json(F.col("Body").cast("string"), schema)
)
df_parsed.select("payload.*").show()Pricing Tiers
| Tier | Throughput | Retention | Partitions | Use Case |
|---|---|---|---|---|
| Basic | 1 MB/s in, 2 MB/s out | 1 day | Up to 32 | Dev/test only |
| Standard | 1 MB/s per TU | 7 days | Up to 32 | Most production workloads |
| Premium | Processing units | 90 days | Up to 100 | High-throughput, low latency |
| Dedicated | Dedicated cluster | 90 days | Up to 1024 | Enterprise, petabyte scale |
Event Hubs in a Real Pipeline
The canonical Azure streaming pipeline looks like this:
🎯 Key Takeaways
- ✓Event Hubs is Kafka-compatible — the same producer/consumer code works with both
- ✓Partitions control throughput — more partitions enable more parallel consumers
- ✓Consumer groups allow multiple independent readers on the same stream
- ✓Capture archives events to ADLS Gen2 automatically — no consumer code needed for batch use cases
- ✓Checkpointing ensures consumers resume from the right position after restart
- ✓Standard tier handles most production DE workloads — start there
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.