Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Advanced+200 XP

Cloud Dataflow

Cloud Dataflow is GCP's fully managed Apache Beam service. It handles both batch and streaming data processing, auto-scales workers up and down based on workload, and integrates natively with BigQuery, Pub/Sub, and GCS.

14 min read March 2026

Dataflow vs. Databricks and Glue

Cloud Dataflow is GCP's answer to Azure Databricks and AWS Glue. All three are managed Spark-compatible processing services. The key difference is that Dataflow is built on Apache Beam — an open-source unified model for both batch and streaming pipelines. You write one pipeline in Python or Java and run it as either batch (reading files) or streaming (reading from Pub/Sub) by changing a single configuration option.

Dataflow is also truly serverless in the GCP sense — it auto-scales workers up and down mid-job based on throughput, and you pay only for the worker time consumed. No cluster to provision upfront, no idle costs.

Batch Processing

Reads files from GCS in bulk. Processes them in parallel across auto-scaled workers. Writes results to BigQuery or GCS. Equivalent to running a Databricks notebook job triggered by ADF.

Stream Processing

Reads live events from Pub/Sub in real time. Applies windowing and aggregations on the fly. Writes micro-batch results to BigQuery. Equivalent to Azure Stream Analytics.

Writing an Apache Beam pipeline for Dataflow

Beam pipelines use a functional, chained syntax with the pipe operator. Each step transforms a PCollection (parallel collection of data) and passes it to the next step. The pipeline definition is lazy — nothing runs until you execute it with a runner (DataflowRunner for GCP, DirectRunner for local testing).

dataflow_batch_pipeline.py
python
# Cloud Dataflow: Apache Beam pipeline — batch processing
# Reads from GCS, transforms, writes to BigQuery

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.gcp.bigquery import WriteToBigQuery
import json

options = PipelineOptions([
    '--runner=DataflowRunner',
    '--project=your-project',
    '--region=us-central1',
    '--temp_location=gs://your-bucket/temp',
    '--staging_location=gs://your-bucket/staging',
    '--job_name=bronze-to-silver-sales',
    '--max_num_workers=10',           # Auto-scales up to 10 workers
    '--autoscaling_algorithm=THROUGHPUT_BASED'
])

def parse_csv(line):
    fields = line.split(',')
    return {
        'order_id':       fields[0],
        'customer_id':    fields[1],
        'order_date':     fields[2],
        'product':        fields[3],
        'quantity':       int(fields[4]),
        'unit_price':     float(fields[5]),
        'region':         fields[6].strip().upper(),
    }

def compute_revenue(record):
    record['net_revenue'] = record['quantity'] * record['unit_price']
    return record

def is_valid(record):
    return (record['order_id'] and 
            record['quantity'] > 0 and 
            record['unit_price'] > 0)

# Define the pipeline using the | operator (Beam's pipeline syntax)
with beam.Pipeline(options=options) as pipeline:
    clean_records = (
        pipeline
        | 'Read from GCS Bronze'    >> beam.io.ReadFromText('gs://your-bucket/bronze/sales/2025/03/01/*.csv', skip_header_lines=1)
        | 'Parse CSV'               >> beam.Map(parse_csv)
        | 'Filter Invalid Records'  >> beam.Filter(is_valid)
        | 'Compute Revenue'         >> beam.Map(compute_revenue)
    )
    
    # Write to BigQuery Silver table
    clean_records | 'Write to BigQuery' >> WriteToBigQuery(
        table='your_project:silver.sales',
        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
    )

print("Dataflow job submitted successfully")
🎯 Pro Tip
Always test Beam pipelines locally first using DirectRunner before submitting to Dataflow. Local testing runs instantly with no cost — just change the runner from DataflowRunner to DirectRunner and point it at a small local file. Once it works locally, swap back to DataflowRunner and submit.

🎯 Key Takeaways

  • Dataflow is managed Apache Beam on GCP — handles both batch and streaming with the same pipeline code
  • Apache Beam uses a functional chained syntax with the pipe operator — read, transform, filter, write
  • Dataflow auto-scales workers up and down mid-job — no cluster sizing, no idle costs
  • DirectRunner runs Beam pipelines locally for fast testing — switch to DataflowRunner for production
  • Dataflow integrates natively with BigQuery, Pub/Sub, and GCS — the core GCP data engineering services
  • One Beam pipeline can run as batch (reading GCS files) or streaming (reading Pub/Sub) by changing the runner config
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...