AWS Glue — Serverless Spark on AWS
AWS Glue is the closest thing AWS has to Azure Databricks — a managed environment for running Spark-based data transformations without managing servers. The key difference: Glue is fully serverless, Databricks gives you more control.
What Glue actually does
Glue runs your Spark transformation code without you managing any servers. You write a Python or Scala script, upload it, tell Glue how many DPUs (data processing units) to use, and it spins up a Spark environment, runs your code, and shuts down.
You pay only for the time the job runs. A 10-minute Glue job costs roughly $0.44. You do not pay for idle time the way you do with a running Databricks cluster.
The tradeoff: Glue is slower to start (30-60 second startup), has less flexibility than Databricks, and debugging is harder because you cannot interactively run cells. For scheduled batch jobs it works very well.
Two things Glue does — know both
Glue has two separate functions that people often confuse.
Glue ETL Jobs — Spark jobs that you write and Glue runs. This is the transformation engine. You write PySpark, Glue executes it.
Glue Data Catalog — a metadata store. It records what tables exist, where they live in S3, and what their schema is. Athena, Redshift Spectrum, and EMR all use the Glue Catalog to know what data exists and where.
You will use both in every AWS data engineering project. They work together.
Writing a Glue ETL job
Glue provides its own wrapper classes (GlueContext, DynamicFrame) on top of Spark. Most engineers use plain PySpark inside Glue — it works fine and is simpler.
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import functions as F
# Glue boilerplate — every job starts with this
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'run_date'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
run_date = args['run_date'] # passed from the Glue trigger or Step Functions
# Read raw data from Bronze S3
df = spark.read.option("header", "true").csv(
f"s3://company-bronze/sales/date={run_date}/"
)
# Clean and transform (same as you would in Databricks)
df = df.dropna(subset=["order_id", "customer_id", "amount"])
df = df.dropDuplicates(["order_id"])
df = df.filter(F.col("amount").cast("double") > 0)
df = df.withColumn("amount", F.col("amount").cast("double"))
df = df.withColumn("order_date", F.to_date(F.col("order_date"), "yyyy-MM-dd"))
df = df.withColumn("gross_amount", F.col("quantity").cast("int") * F.col("amount"))
df = df.withColumn("processed_at", F.current_timestamp())
# Write to Silver as Parquet with partitioning
df.write .mode("overwrite") .partitionBy("order_date") .parquet("s3://company-silver/orders/")
print(f"Processed {df.count()} records for {run_date}")
# Always commit the job at the end
job.commit()The Glue Data Catalog — making your data queryable
Once your Glue job writes data to S3, Athena cannot query it until there is a table definition in the Catalog. You create this either by running a Glue Crawler or by defining the table manually.
import boto3
glue = boto3.client('glue', region_name='us-east-1')
# Create a database first (like a schema in SQL)
glue.create_database(
DatabaseInput={'Name': 'silver_db', 'Description': 'Cleaned data — Silver layer'}
)
# Create a table definition pointing to your S3 data
glue.create_table(
DatabaseName='silver_db',
TableInput={
'Name': 'orders',
'StorageDescriptor': {
'Columns': [
{'Name': 'order_id', 'Type': 'string'},
{'Name': 'customer_id', 'Type': 'string'},
{'Name': 'amount', 'Type': 'double'},
{'Name': 'gross_amount', 'Type': 'double'},
{'Name': 'status', 'Type': 'string'},
],
'Location': 's3://company-silver/orders/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'SerdeInfo': {
'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
},
},
'PartitionKeys': [{'Name': 'order_date', 'Type': 'date'}],
'TableType': 'EXTERNAL_TABLE',
}
)
print("Table created in Glue Catalog")Querying with Athena after Glue writes the data
Athena is AWS's serverless SQL query engine. It reads from S3 using the Glue Catalog as a schema registry. Once your table is registered in the Catalog, analysts can query it with standard SQL. You pay per terabyte of data scanned.
-- Query the silver orders table from Athena
SELECT
order_date,
COUNT(*) as total_orders,
SUM(gross_amount) as total_revenue,
AVG(amount) as avg_order_value
FROM silver_db.orders
WHERE order_date >= DATE '2025-01-01'
GROUP BY order_date
ORDER BY order_date
-- Partition pruning: Athena only scans partitions matching the WHERE clause
-- This is why partitioning matters — without it, every query scans everythingGlue vs Databricks — when to use which
If you are on AWS and your transformations are straightforward batch jobs that run on a schedule — Glue is fine and simpler to operate.
If you need interactive development, complex multi-step notebooks, Delta Lake with MERGE operations, or you want ML workflows in the same environment — use Databricks on AWS.
Many companies run both: Glue for simple scheduled ETL, Databricks for complex transformation logic and data science work.
Three orchestration tools compared — when to use each and why.
What to check, when to check it, and what to do when checks fail.
Why the industry switched from ETL to ELT and what it means for your daily work.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.