Cloud Pub/Sub
Cloud Pub/Sub is GCP's real-time messaging service. It decouples event producers from consumers and handles millions of messages per second — the streaming foundation of every GCP data pipeline.
What Pub/Sub does and why it matters
Cloud Pub/Sub is GCP's equivalent of Azure Event Hubs and Amazon Kinesis. Its core job is decoupling — separating the services that produce events (your web app, mobile app, IoT device) from the services that consume and process them (Dataflow, BigQuery). Producers publish messages to a topic. Consumers subscribe to that topic and receive messages.
This decoupling means if your Dataflow processing job goes down for maintenance, messages queue up in Pub/Sub and are delivered when it comes back. No events are lost. Producers and consumers can scale independently.
Where producers publish messages. Think of a topic as a named channel. Your app publishes sale events to the sales-events topic.
Where consumers receive messages. A topic can have multiple subscriptions — Dataflow and BigQuery can both subscribe to the same topic.
Messages must be acknowledged after processing. Unacknowledged messages are redelivered. This guarantees at-least-once delivery.
Publishing events to Pub/Sub
# Pub/Sub Publisher — send events to a topic
from google.cloud import pubsub_v1
import json
from datetime import datetime
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path('your-project', 'sales-events')
def publish_sale_event(order_id: str, customer_id: str, amount: float, region: str):
event = {
'order_id': order_id,
'customer_id': customer_id,
'amount': amount,
'region': region,
'timestamp': datetime.utcnow().isoformat(),
}
# Messages must be bytes
data = json.dumps(event).encode('utf-8')
# Add attributes for server-side filtering
future = publisher.publish(
topic_path,
data=data,
region=region, # Attribute for subscription filtering
event_type='sale'
)
print(f"Published message ID: {future.result()}")
# Publish 100 sale events
for i in range(100):
publish_sale_event(f'ORD-{i}', f'CUST-{i % 20}', 99.99 * i, 'US-EAST')Subscribing and processing messages
# Pub/Sub Subscriber — pull and process messages
from google.cloud import pubsub_v1
import json
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path('your-project', 'sales-events-sub')
def process_message(message: pubsub_v1.types.ReceivedMessage):
data = json.loads(message.data.decode('utf-8'))
print(f"Processing order {data['order_id']} — ${data['amount']} from {data['region']}")
# Your processing logic here
# save_to_database(data)
# IMPORTANT: Always acknowledge the message after processing
# Unacknowledged messages are redelivered
message.ack()
# Pull messages with streaming pull (most efficient)
streaming_pull_future = subscriber.subscribe(
subscription_path,
callback=process_message
)
print(f"Listening for messages on {subscription_path}")
with subscriber:
try:
streaming_pull_future.result(timeout=60)
except Exception:
streaming_pull_future.cancel()🎯 Key Takeaways
- ✓Pub/Sub decouples event producers from consumers — equivalent to Azure Event Hubs and Amazon Kinesis
- ✓Topics are where producers publish. Subscriptions are where consumers receive. One topic can have many subscriptions
- ✓Always acknowledge messages after processing — unacknowledged messages are redelivered automatically
- ✓The most common pattern: app publishes to Pub/Sub → Dataflow reads from Pub/Sub → writes to BigQuery
- ✓Pub/Sub guarantees at-least-once delivery — design your processing logic to handle duplicate messages safely
- ✓Messages are retained for 7 days by default — useful for replaying events if a consumer goes down
Choosing a streaming ingestion layer — how the three major options compare.
A plain English explanation of Kafka for data engineers — topics, partitions, offsets.
What to check, when to check it, and what to do when checks fail.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.