Azure Databricks
Azure Databricks is where the real transformation work happens. It brings Apache Spark as a fully managed service — write PySpark code in notebooks, run it across a cluster of machines, and process datasets too large to fit in memory on any single computer.
What Databricks actually gives you
Azure Databricks is Apache Spark — the industry-standard distributed computing engine — running as a fully managed Azure service. You don't install Spark, you don't configure clusters from scratch, you don't manage Java dependencies. You open a notebook, write PySpark code, and Databricks handles running it across however many machines the job needs.
For data engineers, Databricks is primarily used for one thing: transforming large datasets. You read raw data from the Bronze layer, apply business logic, clean and validate it, then write clean data to Silver and Gold layers in Delta Lake format. This is the core data engineering workflow on Azure.
Clusters — what they are and how to configure them
A cluster is the group of virtual machines that runs your Spark code. You create a cluster in the Databricks UI, write notebooks that run on it, and Databricks distributes the work across all the machines automatically.
# Databricks cluster configuration (set in the UI or via API)
# For production data engineering workloads:
Cluster Mode: Standard (single user) or Shared (multi-user)
Databricks Runtime: 14.3 LTS (Long Term Support — use LTS in production)
Worker Type: Standard_DS3_v2 (14GB RAM, 4 cores per worker)
Min Workers: 2
Max Workers: 8 (auto-scales based on workload)
Auto-terminate: 60 minutes of inactivity
# For development/exploration — use a smaller cluster:
Worker Type: Standard_DS3_v2
Workers: 1 (fixed, no auto-scale needed for dev)
Auto-terminate: 20 minutesAlways running, interactive, used for writing and testing notebooks. More expensive because it stays on when idle. Use for development only.
Spins up fresh for each job, runs to completion, then terminates automatically. Cheaper than all-purpose. Always use this for ADF-triggered production jobs.
Bronze → Silver notebook
This is the most common Databricks notebook you'll write. It reads raw data from Bronze, applies data quality rules, transforms column types, adds audit columns, and writes clean data to Silver as a Delta table. Notice the ADF parameter at the top — ADF passes the run_date when it triggers this notebook.
# Databricks Notebook: Bronze → Silver transformation
# File: /Shared/bronze_to_silver.py
import sys
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
# Get run_date parameter passed from ADF
dbutils.widgets.text("run_date", "2025-01-01")
run_date = dbutils.widgets.get("run_date")
spark = SparkSession.builder.appName("BronzeToSilver").getOrCreate()
# ── 1. Read raw Bronze data ──────────────────────────────────────────
bronze_path = f"abfss://bronze@yourlake.dfs.core.windows.net/sales/{run_date}/"
df_raw = spark.read.option("header", True).option("inferSchema", True).csv(bronze_path)
print(f"✅ Read {df_raw.count()} rows from Bronze for {run_date}")
# ── 2. Data quality checks ───────────────────────────────────────────
# Remove rows with null primary keys
df = df_raw.filter(F.col("order_id").isNotNull())
# Remove duplicates
df = df.dropDuplicates(["order_id"])
# ── 3. Transformations ───────────────────────────────────────────────
df_clean = (df
.withColumn("order_date", F.to_date("order_date", "yyyy-MM-dd"))
.withColumn("unit_price", F.col("unit_price").cast(DoubleType()))
.withColumn("quantity", F.col("quantity").cast(IntegerType()))
.withColumn("discount", F.coalesce(F.col("discount").cast(DoubleType()), F.lit(0.0)))
.withColumn("total_amount", F.col("unit_price") * F.col("quantity"))
.withColumn("net_amount", F.col("total_amount") * (1 - F.col("discount")))
.withColumn("region", F.upper(F.trim(F.col("region"))))
.withColumn("_load_date", F.lit(run_date))
.withColumn("_load_timestamp",F.current_timestamp())
)
# ── 4. Write to Silver as Delta ──────────────────────────────────────
silver_path = "abfss://silver@yourlake.dfs.core.windows.net/sales/"
(df_clean.write
.format("delta")
.mode("append")
.partitionBy("order_date")
.save(silver_path)
)
print(f"✅ Written {df_clean.count()} clean rows to Silver Delta table")
# Return count for ADF to log
dbutils.notebook.exit(str(df_clean.count()))Silver → Gold notebook
The Gold layer is where you aggregate data into business-ready summaries. This notebook creates two Gold tables from the Silver sales data — a daily sales summary and customer lifetime value. These are the tables that analysts query directly from Synapse Analytics.
# Databricks Notebook: Silver → Gold aggregation
# File: /Shared/silver_to_gold.py
from pyspark.sql import functions as F
from pyspark.sql.window import Window
spark = SparkSession.builder.appName("SilverToGold").getOrCreate()
# Read entire Silver table (Delta handles efficient reads)
df_silver = spark.read.format("delta").load(
"abfss://silver@yourlake.dfs.core.windows.net/sales/"
)
# ── Gold Table 1: Daily Sales Summary ───────────────────────────────
daily_summary = (df_silver
.groupBy("order_date", "region", "product_category")
.agg(
F.sum("net_amount").alias("total_revenue"),
F.count("order_id").alias("order_count"),
F.avg("net_amount").alias("avg_order_value"),
F.countDistinct("customer_id").alias("unique_customers"),
F.sum("quantity").alias("units_sold")
)
.withColumn("revenue_rank",
F.rank().over(Window.partitionBy("order_date").orderBy(F.desc("total_revenue")))
)
)
daily_summary.write.format("delta").mode("overwrite").save(
"abfss://gold@yourlake.dfs.core.windows.net/daily_sales_summary/"
)
# ── Gold Table 2: Customer Lifetime Value ───────────────────────────
customer_ltv = (df_silver
.groupBy("customer_id")
.agg(
F.sum("net_amount").alias("lifetime_value"),
F.count("order_id").alias("total_orders"),
F.min("order_date").alias("first_order_date"),
F.max("order_date").alias("last_order_date"),
F.avg("net_amount").alias("avg_order_value")
)
)
customer_ltv.write.format("delta").mode("overwrite").save(
"abfss://gold@yourlake.dfs.core.windows.net/customer_ltv/"
)
print("✅ Gold layer refreshed successfully")Delta Lake — what makes Databricks special
Delta Lake is an open-source storage layer that adds reliability to your data lake. It stores data as Parquet files but wraps them with a transaction log that enables ACID transactions, schema enforcement, time travel, and efficient upserts. Every production Azure data pipeline uses Delta Lake — it is the standard.
# Delta Lake — the most important Databricks feature for data engineers
# Time Travel: read data as it was at a previous point in time
df_yesterday = spark.read.format("delta").option("timestampAsOf", "2025-01-14").load(silver_path)
df_version5 = spark.read.format("delta").option("versionAsOf", 5).load(silver_path)
# UPSERT (MERGE) — update existing records, insert new ones
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, silver_path)
delta_table.alias("target").merge(
df_updates.alias("source"),
"target.order_id = source.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
# OPTIMIZE — compact small files for faster queries (run weekly)
spark.sql("OPTIMIZE delta.`abfss://silver@yourlake.dfs.core.windows.net/sales/` ZORDER BY (region, order_date)")
# VACUUM — remove old file versions to save storage (keep 7 days)
spark.sql("VACUUM delta.`abfss://silver@yourlake.dfs.core.windows.net/sales/` RETAIN 168 HOURS")🎯 Key Takeaways
- ✓Databricks is managed Apache Spark on Azure — write PySpark notebooks, run them at scale without managing infrastructure
- ✓Use Job Clusters for production ADF-triggered runs (cheaper, auto-terminates). All-Purpose clusters for development only
- ✓The Bronze→Silver notebook is the most common pattern: read raw, validate, transform, write Delta
- ✓Always accept ADF parameters via dbutils.widgets — this makes notebooks reusable for any date range
- ✓Delta Lake adds ACID transactions, time travel, and MERGE (upsert) on top of Parquet — use it for all production tables
- ✓Run OPTIMIZE regularly to compact small files — this is critical for query performance at scale
- ✓dbutils.notebook.exit() returns a value to ADF so the pipeline can log the result of each notebook run
Built PySpark transformation notebooks in Azure Databricks implementing Bronze-to-Silver and Silver-to-Gold Medallion layers
Implemented Delta Lake tables with ACID transactions, time travel, and MERGE upserts for production data pipelines
Optimized Databricks cluster costs by configuring auto-termination and job clusters — reducing compute spend by 60%
Knowledge Check
5 questions · Earn 50 XP for passing · Score 60% or more to pass
Drivers, executors, DAGs, stages — how Spark actually works under the hood.
Two table formats, one decision. How to pick the right one for your stack.
Real PySpark questions asked at FAANG and top data engineering roles — with answers.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.