Amazon EMR
Amazon EMR (Elastic MapReduce) is AWS managed big data platform. It runs Apache Spark, Hadoop, Hive, and Presto on EC2 clusters. EMR is the lower-cost alternative to Databricks for teams that want Spark without the premium.
What is Amazon EMR?
EMR provisions and manages clusters of EC2 instances running big data frameworks — primarily Apache Spark. You define the cluster size, submit Spark jobs, and EMR handles provisioning, configuration, monitoring, and termination.
The core value of EMR over Databricks is cost. EMR charges the EC2 instance price plus a small EMR management fee. Databricks charges EC2 plus Databricks Unit (DBU) fees on top — often 2-3x more expensive for equivalent compute. For teams running large, long-running batch jobs, EMR saves significant money.
Cluster Node Types
Coordinates the cluster. Runs the YARN ResourceManager and HDFS NameNode. One per cluster. If this fails, the cluster fails — use Multi-Master for production.
Runs YARN NodeManager and stores HDFS data. Adding/removing core nodes is risky during jobs. Use for stable baseline capacity.
Runs YARN NodeManager only — no HDFS storage. Safe to add/remove anytime. Use Spot instances here for 70% cost savings.
Launching a Cluster with a Spark Job
# Launch an EMR cluster using AWS CLI
# EMR runs Apache Spark (and Hadoop, Hive, Presto) on EC2
aws emr create-cluster \
--name "DataEngineering-Production" \
--release-label emr-7.0.0 \ # EMR version (includes Spark 3.5)
--applications Name=Spark Name=Hadoop \
--instance-groups \
InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1 \
InstanceGroupType=CORE,InstanceType=m5.2xlarge,InstanceCount=2 \
InstanceGroupType=TASK,InstanceType=m5.2xlarge,InstanceCount=4 \ # auto-scaling workers
--use-default-roles \ # EMR_EC2_DefaultRole + EMR_DefaultRole
--ec2-attributes KeyName=your-key-pair \
--log-uri s3://your-bucket/emr-logs/ \
--auto-terminate \ # cluster shuts down after steps complete
--steps \
Type=Spark,Name="Transform Sales Data",\
Args=[--deploy-mode,cluster,\
--class,com.yourcompany.SalesTransform,\
s3://your-bucket/jars/pipeline.jar,\
--input,s3://your-bucket/bronze/sales/,\
--output,s3://your-bucket/silver/sales/]PySpark Job for EMR
EMR PySpark code is identical to Databricks PySpark — same API, same functions. The only difference is how you configure the SparkSession and how you submit the job.
# PySpark job submitted to EMR
# Save this as transform.py and upload to S3
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
import sys
def main():
spark = SparkSession.builder \
.appName("SalesTransform") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
.config("spark.sql.catalog.glue_catalog.warehouse", "s3://your-bucket/iceberg/") \
.getOrCreate()
input_path = sys.argv[1] # s3://your-bucket/bronze/sales/
output_path = sys.argv[2] # s3://your-bucket/silver/sales/
# Read bronze data from S3
df = spark.read \
.option("header", True) \
.option("inferSchema", True) \
.csv(input_path)
# Transform
df_clean = df \
.filter(F.col("order_id").isNotNull()) \
.dropDuplicates(["order_id"]) \
.withColumn("revenue", F.col("revenue").cast(DoubleType())) \
.withColumn("order_date", F.to_date("order_date", "yyyy-MM-dd")) \
.withColumn("year", F.year("order_date")) \
.withColumn("month", F.month("order_date")) \
.withColumn("day", F.dayofmonth("order_date"))
# Write to S3 as Parquet with partitioning
df_clean.write \
.format("parquet") \
.mode("overwrite") \
.partitionBy("year", "month", "day") \
.save(output_path)
print(f"Written {df_clean.count()} rows to {output_path}")
spark.stop()
if __name__ == "__main__":
main()Submitting Steps to a Running Cluster
# Submit a Spark step to a running EMR cluster
# EMR Steps = individual jobs submitted to the cluster queue
import boto3
emr = boto3.client('emr', region_name='us-east-1')
response = emr.add_job_flow_steps(
JobFlowId='j-YOURCLUSTERID',
Steps=[
{
'Name': 'Transform Sales Data',
'ActionOnFailure': 'CONTINUE', # or TERMINATE_CLUSTER
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'spark-submit',
'--deploy-mode', 'cluster',
'--master', 'yarn',
'--executor-memory', '8G',
'--executor-cores', '4',
'--num-executors', '10',
's3://your-bucket/scripts/transform.py',
's3://your-bucket/bronze/sales/2025/03/15/',
's3://your-bucket/silver/sales/'
]
}
}
]
)
print(f"Step submitted: {response['StepIds']}")EMR vs Databricks
| Aspect | EMR | Databricks |
|---|---|---|
| Cost | Lower — EC2 + EMR fee only | Higher — Databricks Units (DBUs) on top of EC2 |
| Setup | More configuration required | Managed, minimal configuration |
| Notebooks | EMR Studio (limited) | Full collaborative notebook environment |
| Delta Lake | Supported but not default | Native, first-class support |
| Auto-scaling | Managed scaling available | Built-in, seamless |
| ML support | Manual setup required | MLflow built in |
| Best for | Cost-sensitive, ops-heavy teams | Productivity-focused teams |
🎯 Key Takeaways
- ✓EMR runs Apache Spark on EC2 — same Spark code as Databricks, lower cost
- ✓Three node types: Master (coordinator), Core (storage + compute), Task (compute only)
- ✓Run Task nodes on Spot instances for 60-70% cost savings — they are safe to interrupt
- ✓EMR Steps are individual Spark jobs submitted to the cluster queue
- ✓EMR Serverless removes cluster management — submit jobs without provisioning
- ✓Choose EMR over Databricks when cost matters more than developer productivity
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.