FoundationsBeginner+100 XP

What is Data Engineering?

Before you touch a single Azure service or write a line of PySpark, you need to understand what you are actually building and why companies pay so much to hire people who can build it.

10 min March 2026

Roadmap 2026

The simplest definition

A data engineer builds the pipes that move data from where it is created to where it is useful.

That is it. Everything else — the cloud platforms, the Spark clusters, the Medallion Architecture — is just how you build those pipes.

A concrete example

Imagine you work at a retail company. Every day, thousands of customers buy things. Each purchase creates a record in the sales database — product, quantity, price, store, timestamp.

The business wants to know: what are the top-selling products this week? Which stores are underperforming? What should we restock?

None of that happens automatically. Someone has to:

Extract the raw sales data from the database
Clean it (remove duplicates, fix bad data, handle missing values)
Aggregate it (sum by product, group by store, calculate week-over-week)
Load it somewhere analysts can query it — a data warehouse
Make sure it runs every day without breaking

That is what a data engineer does. You build the system that makes this happen automatically, reliably, every single day.

What a data analyst does vs what you do

Data analysts ask questions and find answers. They open a dashboard or write SQL queries to explore data.

Data engineers make sure that data exists in a clean, reliable, queryable form. You build the infrastructure analysts depend on.

If an analyst is asking "why are sales down in Region 3?", you are the person who made sure the sales data was there in the first place.

Both jobs need SQL. But your SQL is for building pipelines. Their SQL is for answering questions.

Batch vs streaming — the two main approaches

Almost every pipeline you build will fall into one of two categories.

Batch processing — collect data over a period of time, then process it all at once. Run at 2am, process yesterday's data, write results, done. Most pipelines in most companies are batch. Simpler, cheaper, easier to debug.

Streaming — process data the moment it arrives, event by event, with latency measured in milliseconds. Used when the business genuinely cannot wait — fraud detection on a credit card transaction, real-time inventory during a flash sale.

Start with batch. Understand it deeply. Most entry-level DE jobs are batch pipelines.

The Medallion Architecture — your main design pattern

Every employer will ask about this. It is the most common data lake design pattern and it is simple once you understand the idea.

Raw data from the source lands in a Bronze layer. Exactly as-is. No changes. This is your backup — if anything goes wrong downstream, you reprocess from Bronze.

Bronze gets cleaned and validated in a Silver layer. Nulls removed. Duplicates dropped. Columns typed correctly. This is where data quality happens.

Silver gets aggregated and shaped into Gold for analysts. Daily sales totals. Customer lifetime value. Regional rankings. Gold tables are what dashboards and reports connect to.

Medallion Architecture

text

Raw Sales CSV → Bronze (ADLS Gen2)
                  ↓
          ADF triggers at 2am
                  ↓
     Databricks cleans → Silver (Delta Lake)
                  ↓
  Databricks aggregates → Gold (Delta Lake)
                  ↓
      Synapse / Power BI queries Gold

What the actual job looks like day to day

A typical week as a junior data engineer at a consulting firm:

Monday — A pipeline failed over the weekend. You check the logs, find the source system sent a file with wrong column names. You fix the schema handling and redeploy.
Tuesday — The business analyst asks why last month's sales numbers look different from last week's report. You trace through the pipeline and find a timezone issue. You fix it and document the root cause.
Wednesday — You are building a new Silver transformation for a new data source (customer returns). You write PySpark code, test it on sample data, review it with your team.
Thursday — You configure Azure Data Factory to run the new pipeline on a schedule and set up monitoring alerts for failures.
Friday — Code review, documentation, and a conversation with the data analyst about what the Gold layer should look like.

Notice that it is mostly debugging, building, and communicating. Not fancy machine learning. Not complex math. Clear thinking and clean code.

Why the money is good

Every company runs on data now. When a data pipeline breaks, business decisions cannot be made. Reports are wrong. Analysts are blocked. Executives are asking questions nobody can answer.

Data engineers keep the lights on. That is why the pay is high and the job market is strong — the work is critical and there are not enough people who know how to do it well.

For someone coming from India targeting H1B roles in the US, data engineering is one of the best paths. Consulting firms like Deloitte, Accenture, and Cognizant hire hundreds of data engineers annually and sponsor H1B visas consistently.

What you need to learn — in order

SQL — you will use this every single day. Window functions, CTEs, aggregations.
Python — for writing pipeline code, PySpark, data validation scripts.
One cloud platform deeply — Azure is the best first choice for H1B/consulting roles.
Apache Spark / PySpark — the standard distributed processing engine.
Delta Lake or Iceberg — the standard table format for data lakes.
Orchestration — ADF for Azure, Airflow for multi-cloud.
One real project — end-to-end, on a real cloud, with real data.

That is the full list. Everything on Asil is structured around those seven things.

📄 Resume Bullet Points

Copy these directly to your resume — tailored from this lesson

•

Designed and implemented batch and streaming data pipelines using the Medallion Architecture (Bronze → Silver → Gold)

•

Built ETL pipelines processing 5M+ records daily across structured and semi-structured data sources

•

Applied data quality validation frameworks to detect nulls, duplicates, and schema violations before loading to analytics layers

🧠

Knowledge Check

5 questions · Earn 50 XP for passing · Score 60% or more to pass

What to learn next

Roadmap 2026

Foundations · 8 min · +100 XP

SQL for Data Engineers

Foundations · 25 min · +150 XP

Roadmap 2026

Foundations

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub