Apache Spark Architecture Explained — How Spark Actually Works
Every data engineer uses Spark but most cannot explain how it works under the hood. Understanding the architecture — drivers, executors, DAGs, stages — is what separates engineers who can debug slow jobs from those who just restart the cluster and hope.
Driver and Executors
Spark runs on a master-worker architecture. The Driver is the brain — it runs your main program, builds the execution plan, and coordinates everything. Executors are the workers — JVM processes running on worker nodes that do the actual computation.
When you call spark.read().csv("..."), nothing happens yet. The Driver records this as a transformation. When you call count() or write(), the Driver compiles all recorded transformations into a DAG and sends tasks to Executors.
DAG — Directed Acyclic Graph
Every Spark job becomes a DAG of stages. A stage is a set of tasks that can be computed without shuffling data across the network.
When Spark hits a wide transformation (groupBy, join, orderBy), it must shuffle data — all rows with the same key must end up on the same partition. This creates a stage boundary. Understanding stage boundaries tells you exactly where your job is spending time.
How partitions map to tasks
Each partition becomes one task. If your DataFrame has 200 partitions, Spark creates 200 tasks per stage. Each Executor runs multiple tasks in parallel based on available cores.
If you have 10 Executors with 4 cores each, Spark runs 40 tasks in parallel. With 200 partitions that's 5 waves of computation. Tuning partition count directly controls parallelism.
The Catalyst optimizer
Before running anything, Spark's Catalyst optimizer rewrites your query plan. It pushes filters down (reads less data), reorders joins (smaller tables first), and combines projections.
This is why df.filter(...).select(...) and df.select(...).filter(...) produce the same physical plan. Catalyst normalizes both into the optimal execution order. Use df.explain(True) to see what Catalyst actually runs.
What this means for your code
Writing good Spark code means helping the optimizer — filter early, avoid Python UDFs (they bypass Catalyst), partition on join keys, and cache DataFrames you reuse multiple times.
Bad Spark code is not wrong — it produces correct results. It just wastes cluster time and money. Understanding the architecture means you can look at a slow job and immediately know whether the bottleneck is shuffles, data skew, small files, or missing filters.