AWS Step Functions
Step Functions is AWS serverless workflow orchestration. You define pipelines as state machines — visual, auditable, and automatically retried. Each step calls an AWS service: Lambda, Glue, Athena, EMR, or anything else in the AWS ecosystem.
What is AWS Step Functions?
Step Functions lets you coordinate multiple AWS services into a pipeline without writing the coordination logic in code. You define a state machine — a series of steps with branching, retries, and error handling — in JSON. AWS visualizes it, executes it, and records every step with full audit history.
For data engineers, Step Functions is most useful for orchestrating pipelines where each step calls a different AWS service: Lambda for validation, Glue for transformation, Athena for aggregation, SNS for alerting.
State Types
Calls an AWS service — Lambda, Glue, EMR, Athena, SNS, SQS, DynamoDB. The core building block. Use .sync suffix to wait for completion before moving on.
Branches based on conditions — like an if/else. Inspect the state input ($.variable) and route to different states. No retry or catch on Choice states.
Run multiple branches simultaneously and wait for all to complete before continuing. Use for independent steps that do not depend on each other.
Process an array of items in parallel — like forEach. Process 30 daily files simultaneously instead of one at a time.
Pause execution for a fixed duration or until a timestamp. Useful for rate limiting and coordinating with external schedules.
Terminal states. Succeed ends the execution successfully. Fail ends with an error and custom message for debugging.
A Complete Data Pipeline State Machine
This state machine validates an S3 file, runs a Glue transformation, checks data quality, runs a Gold aggregation in Athena, then sends a success or failure notification. Every step has retry logic and error handling built in.
// Step Functions State Machine — defined in Amazon States Language (ASL)
// This pipeline: validate file → run Glue job → check quality → notify
{
"Comment": "Daily Sales Data Pipeline",
"StartAt": "ValidateInputFile",
"States": {
"ValidateInputFile": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:ValidateS3File",
"Parameters": {
"bucket.$": "$.bucket",
"key.$": "$.key"
},
"Next": "RunGlueTransform",
"Catch": [{
"ErrorEquals": ["FileNotFoundError"],
"Next": "NotifyFailure"
}]
},
"RunGlueTransform": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync", // .sync = wait for completion
"Parameters": {
"JobName": "sales-bronze-to-silver",
"Arguments": {
"--run_date.$": "$.run_date",
"--input_path.$": "$.input_path"
}
},
"Next": "CheckDataQuality",
"Retry": [{
"ErrorEquals": ["Glue.AWSGlueException"],
"IntervalSeconds": 60,
"MaxAttempts": 2,
"BackoffRate": 2
}]
},
"CheckDataQuality": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:CheckDataQuality",
"Next": "QualityPassed?"
},
"QualityPassed?": {
"Type": "Choice",
"Choices": [{
"Variable": "$.quality_score",
"NumericGreaterThan": 0.95,
"Next": "RunGoldAggregation"
}],
"Default": "NotifyQualityFailure"
},
"RunGoldAggregation": {
"Type": "Task",
"Resource": "arn:aws:states:::athena:startQueryExecution.sync",
"Parameters": {
"QueryString": "INSERT INTO gold_daily_revenue SELECT ...",
"WorkGroup": "pipeline-workgroup"
},
"Next": "NotifySuccess"
},
"NotifySuccess": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456:pipeline-alerts",
"Message": "Daily sales pipeline completed successfully"
},
"End": true
},
"NotifyFailure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456:pipeline-alerts",
"Message.$": "States.Format('Pipeline failed: {}', $.error)"
},
"End": true
},
"NotifyQualityFailure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456:pipeline-alerts",
"Message.$": "States.Format('Quality check failed. Score: {}', $.quality_score)"
},
"End": true
}
}
}Triggering Executions from Python
# Trigger Step Functions execution from Python
import boto3
import json
from datetime import datetime
sf = boto3.client('stepfunctions', region_name='us-east-1')
# Start pipeline execution
response = sf.start_execution(
stateMachineArn='arn:aws:states:us-east-1:123456:stateMachine:SalesPipeline',
name=f"run-{datetime.utcnow().strftime('%Y%m%d-%H%M%S')}", # unique execution name
input=json.dumps({
"run_date": "2025-03-15",
"bucket": "your-data-bucket",
"key": "bronze/sales/2025/03/15/sales.csv",
"input_path": "s3://your-data-bucket/bronze/sales/2025/03/15/"
})
)
execution_arn = response['executionArn']
print(f"Started execution: {execution_arn}")
# Poll for completion (optional — Step Functions runs async)
import time
while True:
status = sf.describe_execution(executionArn=execution_arn)
if status['status'] in ('SUCCEEDED', 'FAILED', 'TIMED_OUT', 'ABORTED'):
print(f"Execution {status['status']}")
break
print(f"Running... status: {status['status']}")
time.sleep(10)Error Handling — Retry and Catch
Automatically retry a failed task with exponential backoff. Define which error types to retry, how many times, and how long to wait between attempts. Essential for transient failures like Glue job timeouts.
Route to a different state when a specific error occurs. A FileNotFoundError goes to NotifyFailure. A Glue timeout triggers a retry. Catch handles errors that cannot be retried.
Set a maximum time a state can run before Step Functions marks it as failed. Prevents a stuck Glue job from holding up the pipeline indefinitely.
For long-running tasks — require periodic heartbeat signals. If the task stops sending heartbeats (crashes silently), Step Functions detects it and can retry or fail the state.
🎯 Key Takeaways
- ✓Step Functions orchestrates AWS services into pipelines using visual state machines
- ✓State types: Task (call AWS service), Choice (branch), Parallel (concurrent), Map (forEach)
- ✓Use .sync resource suffix to wait for Glue, EMR, and Athena jobs to complete before moving on
- ✓Built-in Retry with exponential backoff handles transient failures automatically
- ✓Catch routes different error types to different recovery paths
- ✓Use EventBridge to trigger Step Functions on a cron schedule — no always-running scheduler needed
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.