Data Engineer vs Analyst vs Scientist vs ML Engineer
Clear permanent boundaries between the four most confused roles in tech.
Why Everyone Confuses These Four Roles
If you ask five people what a data scientist does, you will get five different answers. Ask what separates a data engineer from a data analyst and most people will pause, guess, and get it partly wrong. Even hiring managers conflate these roles — which is why job postings sometimes list responsibilities that belong to three different roles under one title.
The confusion has three roots. First, all four roles work with data — so on the surface they seem similar. Second, small companies cannot afford four specialists, so one person does parts of all four roles, blurring the lines. Third, the field is young enough that the boundaries were genuinely unclear until recently.
But the confusion is expensive. If you are targeting a data engineering role and you do not clearly understand where the role ends and data science begins, you will prepare the wrong skills, apply to the wrong jobs, and be blindsided in interviews. If you are already in a data role, not understanding these boundaries means you cannot have productive conversations about responsibilities with your team.
Each role asks a fundamentally different question. The data engineer asks whether data is moving correctly. The analyst asks what the data reveals. The scientist asks what the data predicts. The ML engineer asks how predictions reach users. These questions require different skills, different tools, and different kinds of thinking. They are not interchangeable.
Data Engineer
The data engineer is the infrastructure builder of the data world. Their job is to make data reliably available to everyone else — analysts, scientists, ML engineers, and the business. Without a data engineer, every other data role spends most of their time doing data engineering work badly instead of doing their actual job well.
What they own
Data engineers own the pipelines that move data and the platforms that store it. They design the architecture of the data lake, build the ingestion connectors, write the transformation logic that cleans and structures data, schedule and monitor every automated job, and maintain the data quality checks that ensure downstream consumers can trust what they receive. When data is wrong, missing, or late, a data engineer investigates and fixes it.
TASK PRIMARY SKILL USED
────────────────────────────────────────────────────────────────
Build ingestion pipeline from Python (requests, SQLAlchemy,
Razorpay API to data lake retry logic, checkpointing)
Write transformation that SQL (CTEs, window functions,
cleans and deduplicates orders deduplication patterns)
Debug why yesterday's pipeline Investigation skills + SQL +
produced 15% fewer rows knowledge of data layers
Design the Bronze-Silver-Gold Systems thinking +
table structure for a new source data modelling concepts
Schedule and monitor all Apache Airflow (DAGs,
pipelines with alerting operators, SLAs, XComs)
Optimise a slow Snowflake query SQL query plans + warehouse
from 40 min to 3 min internals (clustering, partitions)
Review a junior DE's code Python best practices + pipeline
for error handling gaps design principles
SKILLS PROFILE:
Python: ████████████ Expert
SQL: ████████████ Expert
System design: ████████ Strong
Statistics: ████ Basic
ML concepts: ████ Basic
Visualisation: ██ MinimalWhat they do not own
Data engineers do not build dashboards — that is the analyst's job. They do not train machine learning models — that is the data scientist's job. They do not deploy models to production APIs — that is the ML engineer's job. They do not define business metrics — that is a business decision made by stakeholders. A data engineer who is doing all of these things is a data engineer at a company too small to have the right specialists yet.
Data Analyst
The data analyst is the translator between data and business decisions. Their job is to take clean, reliable data that the data engineer has made available and turn it into insights that business teams can understand, trust, and act on. A great analyst makes a business smarter about what is actually happening and why.
What they own
Data analysts own the analysis — the questions asked of data and the answers delivered. They write SQL queries to explore data, build dashboards that track business metrics, create reports for stakeholders, and conduct ad-hoc analysis when a business question arises. They define what the metrics mean, verify that the numbers tell a coherent story, and present findings in a way non-technical stakeholders can act on.
TASK PRIMARY SKILL USED
────────────────────────────────────────────────────────────────
Build a weekly retention SQL (cohort queries, date math,
dashboard for the growth team window functions)
Investigate why conversion SQL exploration + business
dropped 8% last week domain knowledge
Create a Power BI report Power BI / Tableau +
showing revenue by city visualisation best practices
Define the "active user" Business logic + stakeholder
metric for the product team communication
Validate that the new DE SQL + cross-checking numbers
pipeline produces correct numbers against known sources
Answer: "Which acquisition SQL (multi-step analysis) +
channel has best LTV?" Excel / Google Sheets
SKILLS PROFILE:
SQL: ████████████ Expert (their primary tool)
Visualisation: ████████████ Expert
Business acumen: ████████ Strong
Python: ████ Basic to intermediate
Statistics: ████████ Intermediate
ML concepts: ██ Minimal
Pipeline code: ██ MinimalThe key difference from a data engineer
Both roles use SQL heavily. The difference is in what they do with it. A data analyst uses SQL to ask questions — to explore and summarise data to find answers. A data engineer uses SQL to build and maintain data models — to create the structures that analysts query. An analyst's SQL query runs once to answer a question. A data engineer's SQL model runs automatically every day in production.
Analysts do not write pipeline code, manage infrastructure, or handle data ingestion. When a data analyst hits a data quality issue, they raise it to the data engineering team. When they need a new data source, they request it from data engineering. The analyst depends on the data engineer having done their job well — and suffers directly when they have not.
Data Scientist
The data scientist uses statistical and machine learning techniques to extract predictions, patterns, and causal understanding from data. Where the analyst looks backward — explaining what happened — the data scientist looks forward: what will likely happen next, and why.
What they own
Data scientists own the modelling work — defining the problem as a machine learning or statistical task, selecting and engineering features, training and evaluating models, and interpreting results for the business. They run experiments (A/B tests, bandit algorithms) to test causal hypotheses. They build the recommendation engines, fraud detection models, demand forecasting systems, and churn prediction models that power data-driven products.
TASK PRIMARY SKILL USED
────────────────────────────────────────────────────────────────
Train a churn prediction model Python (scikit-learn, XGBoost)
for 6-month customer data + statistics + feature engineering
Design an A/B test for a new Statistics (hypothesis testing,
recommendation feature power analysis, p-values)
Analyse whether a new pricing Causal inference + regression
strategy caused revenue lift analysis + business context
Build a demand forecasting Time series analysis (Prophet,
model for Zepto's dark stores ARIMA, or deep learning)
Present model results to Communication + storytelling
the product team + visualisation
Request training data from Collaboration with data engineers
the DE team (feature pipeline) + feature specification
SKILLS PROFILE:
Python (ML): ████████████ Expert
Statistics: ████████████ Expert
SQL: ████████ Strong (queries, feature extraction)
ML frameworks: ████████████ Expert (scikit-learn, PyTorch, etc.)
Business acumen: ████████ Strong
Pipeline code: ████ Basic (can write, not their primary job)
Production infra: ████ BasicThe critical dependency on data engineering
Data scientists are the most dependent of all roles on data engineering being done well. A data scientist who does not have reliable, clean feature data cannot train a trustworthy model. A model trained on inconsistent data will behave unpredictably in production. Studies consistently show that data scientists spend 60–80% of their time on data cleaning and preparation at companies with poor data engineering — which means 60–80% of an expensive, specialised skill set is wasted on work that should not be their job.
At a well-engineered company, data scientists get clean feature tables from the data engineer and spend the majority of their time on actual modelling work. This is why good data engineering multiplies the productivity of the entire data organisation.
ML Engineer
The ML engineer sits at the intersection of data science and software engineering. Their job is to take a model that a data scientist trained in a notebook and make it reliably serve predictions to millions of users in a production application — with the same engineering rigour applied to any production software system.
What they own
ML engineers own the production ML systems — model serving infrastructure, real-time feature pipelines, model monitoring, retraining pipelines, and the APIs that serve predictions to applications. When a data scientist says "the model is ready," the ML engineer takes over and makes it production-grade. This involves containerising the model, building the serving API, setting up monitoring to detect model drift, and building the automation that retrains and redeploys the model when its performance degrades.
TASK PRIMARY SKILL USED
────────────────────────────────────────────────────────────────
Wrap a trained model in a Python (FastAPI, Flask) +
REST API with <50ms latency Docker + optimisation
Build real-time feature pipeline Python + Kafka + Redis
that serves features in <10ms (low-latency data access)
Set up model monitoring that MLflow / Evidently + Python
alerts when predictions drift + statistical drift detection
Build automated retraining Python + Airflow/Prefect +
pipeline triggered by metric drop model evaluation logic
Deploy model to Kubernetes Docker + Kubernetes + CI/CD
with autoscaling and rollback + cloud (EKS/AKS/GKE)
Benchmark: can our fraud model Performance profiling +
score 10,000 transactions/second? load testing + optimisation
SKILLS PROFILE:
Python: ████████████ Expert
Software eng: ████████████ Expert (APIs, testing, CI/CD)
ML frameworks: ████████████ Expert
Statistics: ████████ Strong
Infrastructure: ████████ Strong (Docker, K8s, cloud)
SQL: ████████ Strong
Pipeline code: ████████ Strong (real-time focus)How ML engineer differs from data scientist
Data scientists optimise for model accuracy — they care about whether the model predicts correctly. ML engineers optimise for model reliability — they care about whether the model serves predictions correctly, consistently, and at the required speed, without failing, even when traffic is 10× normal. A data scientist's primary artefact is a trained model. An ML engineer's primary artefact is a system that serves model predictions reliably in production.
How ML engineer differs from data engineer
Both roles build data pipelines, but for different purposes. A data engineer builds batch pipelines that process large historical datasets for analysis. An ML engineer builds real-time feature pipelines that serve pre-computed features with millisecond latency for live model inference. A data engineer's pipeline can tolerate one-hour latency. An ML engineer's feature pipeline must respond in under 10 milliseconds.
All Four Roles — Side by Side
The fastest way to permanently internalise the boundaries is to see the same dimension compared across all four roles at once.
DIMENSION DATA ENGINEER DATA ANALYST DATA SCIENTIST ML ENGINEER
─────────────────────────────────────────────────────────────────────────────────
Core question "Is data moving "What happened "What will "How do predictions
reliably?" and why?" happen next?" reach users at scale?"
Primary output Pipelines + Dashboards + Trained models + Production ML
data tables reports experiment results systems + APIs
Primary skill Python + SQL + Python + Python +
system design visualisation statistics software engineering
SQL usage Build models Query for Extract features Query for
that run daily ad-hoc answers for training monitoring
Python usage Pipeline code Basic scripts Model training APIs + infra
Cares about Pipeline Business logic Statistical Model latency,
data quality for correctness correctness quality of data throughput, drift
Works with data Moves it, Queries it, Trains on it, Serves predictions
structures it analyses it learns from it from it
Infra ownership Data platform None Notebooks/ Full production
(lake, warehouse) experiments ML infra
Depends on Source systems Data engineers Data engineers Data scientists
being accessible making data for clean data for trained models
reliable + clean
Blocked when Source schema DE pipeline Data is dirty, Model is not
changes fails or data unstructured, production-ready
is stale or unavailable or drifting
Typical background CS, SWE, or Business, econ, Statistics, SWE or DS with
analytics analytics, CS maths, CS strong eng skillsThe dependency chain — visualised
These roles are not independent. They form a chain where each role enables the next. This chain is why data engineering being done well has a multiplier effect on the entire organisation, and why data engineering being done poorly makes everyone downstream less effective.
Raw data in source systems
│
│ Ingestion, transformation, quality, reliability
▼
┌─────────────────────────────────────────────┐
│ DATA ENGINEER │
│ Builds and maintains the data platform │
│ Ensures clean, timely, trustworthy data │
└──────────────┬──────────────────────────────┘
│ Reliable, structured data available
┌─────────┴──────────────┐
│ │
▼ ▼
DATA ANALYST DATA SCIENTIST
Queries Gold tables Gets Silver/feature
for business insights tables for model training
Builds dashboards Trains models, runs experiments
Reports findings Produces predictions
│ │
▼ ▼
Business decisions ML ENGINEER
Strategy, resource Takes trained model
allocation, product Makes it production-grade
planning Serves predictions in real-time
Monitors for drift
│
▼
Production ML features
(recommendations, fraud scores,
demand forecasts, rankings)
Impact of DE failure:
If DE pipelines fail:
→ Analyst has no data to analyse → dashboard goes stale
→ Scientist has no training data → models cannot be retrained
→ ML Engineer has no features → model predictions degrade silentlyWhere the Roles Overlap — And What "Unicorn" Really Means
In a perfect world with unlimited budget, every company would have dedicated specialists in each role. In the real world — especially at early-stage startups — one or two people cover all four roles. This creates the "unicorn data scientist" myth: one person who can ingest, clean, model, deploy, and visualise everything.
What actually happens at different company stages
The Analytics Engineer — the emerging fifth role
A relatively new role that sits at the boundary of data engineering and data analysis is the Analytics Engineer. Analytics engineers own the transformation layer — they write dbt models that turn raw data into clean, analysis-ready tables. They have stronger SQL skills than a typical data engineer but stronger data modelling instincts than a typical analyst.
Analytics engineers are hired by companies where the transformation work is large enough to need a dedicated owner, but the work is SQL-based rather than Python pipeline-based. They are the main users of dbt in most organisations. If you enjoy SQL data modelling more than pipeline infrastructure, this is an increasingly viable career path with strong demand in 2026.
Which Role Should You Target? An Honest Decision Framework
This is the most practically important part of this module for someone at the beginning of their career. The answer is not "whichever pays the most" or "whichever sounds most impressive." The answer is "whichever matches how your brain actually works and what kind of problems you genuinely enjoy."
The fastest way to get clarity is to answer these four questions honestly:
The honest advice for non-IT background candidates
If you are coming from a non-IT background, the most accessible path in terms of time to first job and availability of entry-level roles is typically: Data Analyst → Data Engineer in that order, or Data Engineer directly if you are comfortable with programming concepts.
Data analysis is accessible with strong SQL skills alone — you do not need Python to get a first analyst job. Data science requires a strong statistics background that takes longer to build. ML engineering requires solid software engineering foundations on top of ML knowledge. Data engineering is in between — you need Python and SQL, but you do not need advanced mathematics.
Role Min time to Key bottleneck
first job
Data Analyst 4–6 months SQL proficiency + BI tool (Power BI/Tableau)
+ portfolio of analysis projects
Data Engineer 6–9 months Python + SQL + one cloud cert (DP-203 or
AWS SAA) + 3 pipeline projects on GitHub
Data Scientist 12–18 months Statistics + Python (ML libs) + maths background
The statistics foundation takes longest to build
ML Engineer 18–24 months Requires strong software engineering first,
then ML + production infra on top
These timelines assume 15–20 hours/week of focused study.
Consistent daily practice beats intensive weekend sprints.One Business Problem — Four Roles, Four Completely Different Jobs
The product team wants to build a "Recommended for You" section on the home screen. Each user sees restaurants personalised to their taste history. Here is what each role does to make this happen.
Every role was essential. The ML engineer cannot serve what the scientist did not train. The scientist cannot train what the data engineer did not prepare. The analyst cannot measure what the product team did not define. And none of it reaches users without the ML engineer's production system. This is the dependency chain made real.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Each role asks a different core question. Data Engineer: "Is data moving reliably?" Analyst: "What happened?" Scientist: "What will happen?" ML Engineer: "How do predictions reach users at scale?" These questions require different skills and are not interchangeable.
- ✓Data engineers own the pipeline and the platform — they build the infrastructure everyone else depends on. They do not build dashboards, train models, or deploy serving APIs.
- ✓Data analysts own the analysis — SQL queries, dashboards, reports, and metric definitions. Their primary tool is SQL, not Python. They consume what data engineers produce.
- ✓Data scientists own the modelling — training ML models, running experiments, and interpreting statistical results. They depend on data engineers for clean feature data. Without reliable data engineering, 60–80% of their time goes to cleaning data instead of modelling.
- ✓ML engineers own production ML systems — model serving, real-time feature pipelines, monitoring, and automated retraining. They bridge data science and software engineering.
- ✓The four roles form a dependency chain: DE enables DA and DS. DS enables MLE. A failure in DE propagates to all downstream roles. This is why good data engineering multiplies the productivity of the entire data organisation.
- ✓At small companies one person covers multiple roles. At large companies roles are fully specialised. "Data Scientist" at a Series A startup often means "person who does all data work including engineering" — evaluate the actual responsibilities, not the title.
- ✓The Analytics Engineer is an emerging fifth role that owns the transformation layer (dbt models). They sit between engineering and analysis with stronger SQL data modelling skills than either traditional role.
- ✓Job postings listing ETL pipeline maintenance, Airflow DAGs, and warehouse management under a "Data Scientist" title are mislabelled data engineering roles. Clarify responsibilities before accepting to avoid career positioning mistakes.
- ✓For non-IT background candidates, Data Analyst (4–6 months to first job with SQL) is the most accessible entry point. Data Engineer (6–9 months) is next. Data Science requires longer because of the statistics foundation needed.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.