InterviewApache Spark

15 PySpark Interview Questions Asked at Real Data Engineering Roles

March 5, 2026 10 min read✍️ by Asil

These are real PySpark questions asked at consulting firms, financial services companies, and technology companies that sponsor H1B visas. For each question, I have included the answer interviewers actually want to hear — not the textbook definition.

Fundamentals (expect all of these)

Q1: What is the difference between a transformation and an action in Spark?

A: Transformations (filter, select, groupBy, join) are lazy — they define a computation plan but do not execute. Actions (count, collect, show, write) trigger actual execution. This lazy evaluation allows Spark to optimize the full execution plan before running anything.

Q2: What is a DataFrame vs an RDD?

A: RDD is the low-level Spark API — distributed collection of objects with no schema. DataFrame is the higher-level API with a schema, similar to a SQL table. In 99% of production code you use DataFrames. RDDs are rarely written directly anymore.

Q3: What is a partition in Spark?

A: A partition is a chunk of the data distributed across worker nodes. Spark processes each partition in parallel. Too few partitions = underutilized cluster. Too many = excessive overhead. Rule of thumb: aim for 128-256MB per partition.

Q4: Explain narrow vs wide transformations.

A: Narrow transformations (filter, select, map) process each partition independently — no data movement between nodes. Wide transformations (groupBy, join, orderBy) require shuffling data across the network, which is expensive. Minimize wide transformations to optimize Spark jobs.

Performance questions (asked at most senior roles)

Q5: Your Spark job is slow. How do you debug it?

A: Start with the Spark UI — look at the Stages tab for skewed stages (one task taking much longer than others). Check for data skew, excessive shuffles, or small file problems. Use df.explain() to see the physical plan and spot unnecessary shuffles.

Q6: What causes data skew and how do you fix it?

A: Skew happens when one partition has significantly more data than others — common when joining on a column with a dominant value (like NULL or a single customer ID with millions of rows). Fix with salting: add a random prefix to the skewed key, join on the salted key, then aggregate afterward.

Q7: What is broadcast join and when do you use it?

A: When joining a large table with a small table (typically under 10MB), broadcast join sends the small table to every worker node, avoiding an expensive shuffle. Use spark.sql.autoBroadcastJoinThreshold or F.broadcast(small_df) explicitly.

Delta Lake questions (common at Azure and Databricks shops)

Q8: What is Delta Lake and why is it used over plain Parquet?

A: Delta Lake adds ACID transactions, schema enforcement, time travel, and efficient upserts (MERGE) on top of Parquet files. Plain Parquet has no transaction support — concurrent writes can corrupt data. Delta prevents this.

Q9: How does the MERGE statement work in Delta Lake?

A: MERGE matches rows between source and target on a key, then applies conditional logic: update if matched, insert if not matched, delete if matched and a condition is true. This enables efficient upserts for slowly changing dimensions and incremental loads.

Q10: What is Z-ordering in Delta Lake?

A: Z-ordering co-locates related data in the same files, so queries with filters on Z-ordered columns skip more files. Use OPTIMIZE table ZORDER BY (column) on columns you frequently filter on.

Code questions (you will write PySpark live)

Q11: Read a CSV, remove nulls in order_id, remove duplicates, write as Delta.

from pyspark.sql import functions as F

df = spark.read.option("header", True).csv("path/to/file.csv")

df = df.filter(F.col("order_id").isNotNull())

df = df.dropDuplicates(["order_id"])

df.write.format("delta").mode("overwrite").save("path/to/output")

Q12: Calculate 7-day rolling average revenue per customer.

from pyspark.sql.window import Window

window = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(-6, 0)

df = df.withColumn("rolling_7d_avg", F.avg("revenue").over(window))

Architecture and design questions

Q13: How do you handle late-arriving data in a streaming pipeline?

A: Use watermarking in Spark Structured Streaming. Define a watermark on the event timestamp — events arriving later than the watermark threshold are dropped. This bounds state size and memory usage while handling reasonable latency.

Q14: What is the difference between cache() and persist()?

A: cache() stores data in memory only. persist() allows you to specify storage level — MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY. Use persist(MEMORY_AND_DISK) when the DataFrame is large and might not fit entirely in memory.

Q15: Your pipeline processes 50GB daily. It takes 4 hours. How do you speed it up?

A: First, profile it — identify the bottleneck. Common fixes: increase cluster size, repartition before heavy transforms (df.repartition(200)), replace Python UDFs with native Spark functions (10-100x faster), push filters as early as possible, use Delta Lake Z-ordering on filter columns.

Ready to apply this?

Practice with the full interview prep guide

Back to all articles