Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Advanced+200 XP

Amazon EMR

Amazon EMR (Elastic MapReduce) is AWS managed big data platform. It runs Apache Spark, Hadoop, Hive, and Presto on EC2 clusters. EMR is the lower-cost alternative to Databricks for teams that want Spark without the premium.

13 min read March 2026

What is Amazon EMR?

EMR provisions and manages clusters of EC2 instances running big data frameworks — primarily Apache Spark. You define the cluster size, submit Spark jobs, and EMR handles provisioning, configuration, monitoring, and termination.

The core value of EMR over Databricks is cost. EMR charges the EC2 instance price plus a small EMR management fee. Databricks charges EC2 plus Databricks Unit (DBU) fees on top — often 2-3x more expensive for equivalent compute. For teams running large, long-running batch jobs, EMR saves significant money.

EMR on EC2 vs EMR Serverless
EMR on EC2 gives you full cluster control — instance types, Spark configs, cluster lifetime. EMR Serverless is the newer, managed option — you submit jobs without provisioning a cluster. EMR Serverless is simpler but less configurable. Start with EMR on EC2 to learn the concepts.

Cluster Node Types

👑
Master Node

Coordinates the cluster. Runs the YARN ResourceManager and HDFS NameNode. One per cluster. If this fails, the cluster fails — use Multi-Master for production.

💾
Core Node

Runs YARN NodeManager and stores HDFS data. Adding/removing core nodes is risky during jobs. Use for stable baseline capacity.

Task Node

Runs YARN NodeManager only — no HDFS storage. Safe to add/remove anytime. Use Spot instances here for 70% cost savings.

Launching a Cluster with a Spark Job

launch_emr_cluster.sh
bash
# Launch an EMR cluster using AWS CLI
# EMR runs Apache Spark (and Hadoop, Hive, Presto) on EC2

aws emr create-cluster \
  --name "DataEngineering-Production" \
  --release-label emr-7.0.0 \                    # EMR version (includes Spark 3.5)
  --applications Name=Spark Name=Hadoop \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1 \
    InstanceGroupType=CORE,InstanceType=m5.2xlarge,InstanceCount=2 \
    InstanceGroupType=TASK,InstanceType=m5.2xlarge,InstanceCount=4 \   # auto-scaling workers
  --use-default-roles \                           # EMR_EC2_DefaultRole + EMR_DefaultRole
  --ec2-attributes KeyName=your-key-pair \
  --log-uri s3://your-bucket/emr-logs/ \
  --auto-terminate \                              # cluster shuts down after steps complete
  --steps \
    Type=Spark,Name="Transform Sales Data",\
    Args=[--deploy-mode,cluster,\
          --class,com.yourcompany.SalesTransform,\
          s3://your-bucket/jars/pipeline.jar,\
          --input,s3://your-bucket/bronze/sales/,\
          --output,s3://your-bucket/silver/sales/]
Use Spot instances on Task nodes
Task nodes do not store HDFS data — they are safe to terminate mid-job (Spark will retry failed tasks). Run Task nodes on Spot instances for 60-70% cost savings. Run Master and Core nodes on On-Demand for stability.

PySpark Job for EMR

EMR PySpark code is identical to Databricks PySpark — same API, same functions. The only difference is how you configure the SparkSession and how you submit the job.

transform.py
python
# PySpark job submitted to EMR
# Save this as transform.py and upload to S3

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
import sys

def main():
    spark = SparkSession.builder \
        .appName("SalesTransform") \
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
        .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
        .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
        .config("spark.sql.catalog.glue_catalog.warehouse", "s3://your-bucket/iceberg/") \
        .getOrCreate()

    input_path  = sys.argv[1]   # s3://your-bucket/bronze/sales/
    output_path = sys.argv[2]   # s3://your-bucket/silver/sales/

    # Read bronze data from S3
    df = spark.read \
        .option("header", True) \
        .option("inferSchema", True) \
        .csv(input_path)

    # Transform
    df_clean = df \
        .filter(F.col("order_id").isNotNull()) \
        .dropDuplicates(["order_id"]) \
        .withColumn("revenue",    F.col("revenue").cast(DoubleType())) \
        .withColumn("order_date", F.to_date("order_date", "yyyy-MM-dd")) \
        .withColumn("year",       F.year("order_date")) \
        .withColumn("month",      F.month("order_date")) \
        .withColumn("day",        F.dayofmonth("order_date"))

    # Write to S3 as Parquet with partitioning
    df_clean.write \
        .format("parquet") \
        .mode("overwrite") \
        .partitionBy("year", "month", "day") \
        .save(output_path)

    print(f"Written {df_clean.count()} rows to {output_path}")
    spark.stop()

if __name__ == "__main__":
    main()

Submitting Steps to a Running Cluster

submit_step.py
python
# Submit a Spark step to a running EMR cluster
# EMR Steps = individual jobs submitted to the cluster queue

import boto3

emr = boto3.client('emr', region_name='us-east-1')

response = emr.add_job_flow_steps(
    JobFlowId='j-YOURCLUSTERID',
    Steps=[
        {
            'Name': 'Transform Sales Data',
            'ActionOnFailure': 'CONTINUE',   # or TERMINATE_CLUSTER
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': [
                    'spark-submit',
                    '--deploy-mode', 'cluster',
                    '--master', 'yarn',
                    '--executor-memory', '8G',
                    '--executor-cores', '4',
                    '--num-executors', '10',
                    's3://your-bucket/scripts/transform.py',
                    's3://your-bucket/bronze/sales/2025/03/15/',
                    's3://your-bucket/silver/sales/'
                ]
            }
        }
    ]
)
print(f"Step submitted: {response['StepIds']}")

EMR vs Databricks

AspectEMRDatabricks
CostLower — EC2 + EMR fee onlyHigher — Databricks Units (DBUs) on top of EC2
SetupMore configuration requiredManaged, minimal configuration
NotebooksEMR Studio (limited)Full collaborative notebook environment
Delta LakeSupported but not defaultNative, first-class support
Auto-scalingManaged scaling availableBuilt-in, seamless
ML supportManual setup requiredMLflow built in
Best forCost-sensitive, ops-heavy teamsProductivity-focused teams

🎯 Key Takeaways

  • EMR runs Apache Spark on EC2 — same Spark code as Databricks, lower cost
  • Three node types: Master (coordinator), Core (storage + compute), Task (compute only)
  • Run Task nodes on Spot instances for 60-70% cost savings — they are safe to interrupt
  • EMR Steps are individual Spark jobs submitted to the cluster queue
  • EMR Serverless removes cluster management — submit jobs without provisioning
  • Choose EMR over Databricks when cost matters more than developer productivity
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...