Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Advanced+200 XP

Azure Event Hubs

Event Hubs is Azure's managed event streaming service — the Azure-native equivalent of Apache Kafka. It ingests millions of events per second from applications, IoT devices, and microservices, then makes them available for real-time processing or batch archiving.

14 min read March 2026

What is Azure Event Hubs?

Event Hubs is a fully managed, real-time data ingestion service. Producers send events (clicks, transactions, sensor readings, logs) into Event Hubs. Consumers read those events and process them — either in real time or in batches after they have been archived to storage.

The key thing to understand about Event Hubs: it is Kafka-compatible. You can use the Kafka protocol to produce and consume events without changing your application code. If you know Kafka, you know Event Hubs.

Event Hubs vs Azure Service Bus
Event Hubs is for high-throughput event streaming — millions of events, multiple consumers, replay. Service Bus is for reliable message delivery — one consumer, guaranteed delivery, dead-letter queues. Use Event Hubs for data pipelines. Use Service Bus for application-to-application messaging.

Core Concepts

Namespace

The top-level container for Event Hubs — like a server. One namespace can contain multiple Event Hubs. The namespace has the connection string.

Event Hub

A single stream/topic inside a namespace. Equivalent to a Kafka topic. One Event Hub typically represents one data source (e.g. clickstream, transactions).

Partition

Events are distributed across partitions for parallelism. Each partition is an ordered, immutable sequence of events. More partitions = more throughput.

Consumer Group

A view of the entire Event Hub for one application. Multiple consumer groups read the same events independently — a streaming pipeline and an ML model can both read the same stream.

Checkpoint

A saved position in the stream. When a consumer restarts, it resumes from the last checkpoint instead of reprocessing everything from the start.

Capture

Auto-archive feature. Automatically writes incoming events to ADLS Gen2 or Blob Storage as Avro files on a time/size schedule — no consumer code needed.

Sending Events — Producer Code

Producers send events in batches for efficiency. Each event is a JSON payload (or any binary format) representing something that happened — a sale, a click, a sensor reading.

event_producer.py
python
# Send events to Event Hubs using Python SDK
# pip install azure-eventhub

import asyncio
import json
from azure.eventhub.aio import EventHubProducerClient
from azure.eventhub import EventData
from datetime import datetime

CONNECTION_STRING = "Endpoint=sb://yournamespace.servicebus.windows.net/;..."
EVENT_HUB_NAME   = "sales-events"

async def send_events():
    producer = EventHubProducerClient.from_connection_string(
        conn_str=CONNECTION_STRING,
        eventhub_name=EVENT_HUB_NAME
    )

    async with producer:
        # Create a batch — Event Hubs sends in batches for efficiency
        event_batch = await producer.create_batch()

        # Add events to the batch
        for i in range(10):
            payload = {
                "event_id":   f"EVT-{i:04d}",
                "customer_id": f"CUST-{i * 100}",
                "amount":      round(99.99 + i * 10, 2),
                "timestamp":   datetime.utcnow().isoformat(),
                "event_type":  "purchase"
            }
            event_batch.add(EventData(json.dumps(payload)))

        # Send the whole batch in one network call
        await producer.send_batch(event_batch)
        print(f"Sent batch of {len(event_batch)} events")

asyncio.run(send_events())

Consuming Events — Consumer Code

Consumers read events and process them. The checkpoint store ensures that if your consumer crashes and restarts, it picks up exactly where it left off — no events are missed or reprocessed.

event_consumer.py
python
# Read events from Event Hubs — consumer group pattern
# Multiple consumers can read the same stream independently

import asyncio
from azure.eventhub.aio import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblobaio import BlobCheckpointStore

# Checkpoint store tracks which events have been processed
# If consumer restarts, it resumes from last checkpoint — not from the start
CHECKPOINT_STORE = BlobCheckpointStore.from_connection_string(
    conn_str="DefaultEndpointsProtocol=https;AccountName=yourstorage;...",
    container_name="eventhub-checkpoints"
)

async def on_event(partition_context, event):
    # Process the event
    payload = json.loads(event.body_as_str())
    print(f"Partition {partition_context.partition_id}: {payload['event_id']}")

    # Write to database, trigger downstream processing, etc.
    # ...

    # Checkpoint — marks this event as processed
    await partition_context.update_checkpoint(event)

async def consume():
    client = EventHubConsumerClient.from_connection_string(
        conn_str=CONNECTION_STRING,
        consumer_group="$Default",          # or a named consumer group
        eventhub_name=EVENT_HUB_NAME,
        checkpoint_store=CHECKPOINT_STORE,  # enables resume on restart
    )

    async with client:
        await client.receive(
            on_event=on_event,
            starting_position="-1",  # -1 = read from beginning of stream
        )

asyncio.run(consume())
Consumer Groups for Fan-Out
Create separate consumer groups for each downstream system. Your real-time dashboard reads from consumer group "dashboard". Your Databricks pipeline reads from consumer group "pipeline". Both get every event independently — neither blocks or affects the other.

Event Hubs Capture — Archive to ADLS Gen2

Capture is the simplest way to get streaming data into your data lake. Enable it in the Azure Portal and Event Hubs automatically writes incoming events to ADLS Gen2 as Avro files — no consumer code needed. Then Databricks reads the archived files for batch processing.

read_capture.py
python
// Event Hubs Capture — automatically archive events to ADLS Gen2
// Configure in Azure Portal: Event Hub → Features → Capture

// Capture writes Avro files to your storage account in this path format:
// {Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}.avro

// Example file path:
// myNamespace/sales-events/0/2025/03/15/14/30/00.avro

// Read captured Avro files in Databricks:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ReadEventHubCapture").getOrCreate()

# Read all captured Avro files for today
df = spark.read.format("avro").load(
    "abfss://capture@yourstorage.dfs.core.windows.net/myNamespace/sales-events/*/2025/03/15/*/*/*.avro"
)

# Event Hubs Capture wraps your payload in a Body column (binary)
import pyspark.sql.functions as F

df_parsed = df.withColumn(
    "payload",
    F.from_json(F.col("Body").cast("string"), schema)
)

df_parsed.select("payload.*").show()

Pricing Tiers

TierThroughputRetentionPartitionsUse Case
Basic1 MB/s in, 2 MB/s out1 dayUp to 32Dev/test only
Standard1 MB/s per TU7 daysUp to 32Most production workloads
PremiumProcessing units90 daysUp to 100High-throughput, low latency
DedicatedDedicated cluster90 daysUp to 1024Enterprise, petabyte scale
Start with Standard tier
For most production data engineering workloads, Standard tier with 1-2 Throughput Units is sufficient. Dedicated clusters are only needed at very high scale (10,000+ events/second sustained).

Event Hubs in a Real Pipeline

The canonical Azure streaming pipeline looks like this:

Application / IoT DeviceEvent HubsStream Analytics / DatabricksADLS Gen2 / Synapse

🎯 Key Takeaways

  • Event Hubs is Kafka-compatible — the same producer/consumer code works with both
  • Partitions control throughput — more partitions enable more parallel consumers
  • Consumer groups allow multiple independent readers on the same stream
  • Capture archives events to ADLS Gen2 automatically — no consumer code needed for batch use cases
  • Checkpointing ensures consumers resume from the right position after restart
  • Standard tier handles most production DE workloads — start there
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...