How Data Moves Through a Company
The complete end-to-end journey — from data creation to business decisions.
Follow One Data Point From Birth to Dashboard
The best way to understand how data moves through a company is to follow a single data point from the moment it is created to the moment it influences a decision. Not in theory — in the specific, concrete reality of what actually happens at the system level.
The data point we will follow: a customer places an order on Zomato at 8:14 PM on a Tuesday evening. They order one Butter Chicken and two Garlic Naan from a restaurant in Koramangala, Bangalore. The order total is ₹380.
By the time the Zomato data team arrives at work the next morning, that single order has touched at least eight different systems, been stored in at least four different formats, been transformed at least three times, and contributed to at least a dozen different metrics. Here is exactly how.
8:14:32 PM → Customer taps "Place Order" on the Zomato app
8:14:32 PM → Mobile app sends HTTP POST request to Zomato's
Order Service API with order details as JSON
8:14:33 PM → Order Service validates the request, assigns
order ID #ZMT-9284751, writes a row to the
Orders table in the production PostgreSQL database
8:14:33 PM → PostgreSQL writes the row to its Write-Ahead Log (WAL)
— the permanent transaction record
8:14:34 PM → A Change Data Capture (CDC) agent reads the WAL,
detects the new order row, and publishes an event
to a Kafka topic called "order.created"
8:14:34 PM → Multiple services consume from "order.created":
- Notification service → sends SMS to customer
- Restaurant service → sends order to restaurant tablet
- Analytics service → forwards event to data platform
8:14:35 PM → Analytics service writes the raw event as JSON
to Azure Data Lake Storage (the landing zone)
Path: /raw/orders/2026/03/17/20/ZMT-9284751.json
11:00 PM → Scheduled batch pipeline runs:
Reads all new JSON files from the landing zone
Validates, deduplicates, converts to Parquet format
Writes to: /bronze/orders/date=2026-03-17/
2:00 AM → dbt transformation runs:
Reads bronze Parquet files
Cleans, standardises, joins with restaurant and
customer reference data
Writes to silver layer: orders_cleaned table
4:00 AM → Aggregation pipeline runs:
Reads silver orders_cleaned
Computes daily metrics: GMV by city, by cuisine,
by hour, repeat customer rate, average order value
Writes to gold layer: daily_order_metrics table
6:00 AM → Power BI dashboard refreshes
Reads from gold layer
Dashboard now shows last night's order data
including our ₹380 Koramangala order
9:00 AM → Zomato growth manager opens the dashboard
Sees: Bangalore GMV last night: ₹4.2 crore
Our order contributed ₹380 to that numberThat is the journey. One tap on a phone, eight systems, fourteen hours, one number on a dashboard. Every step in that journey is something a data engineer designed, built, or maintains. Now we are going to understand each step properly.
Source Systems — Where Every Data Journey Begins
A source system is any system that creates data as a byproduct of doing its primary job. Source systems are not built to make data available for analysis — they are built to run the business. Data availability for analytics is a side effect that data engineers must deliberately capture.
Understanding source systems is the first thing a data engineer does when joining a company or starting a new project. You cannot build a pipeline to move data without knowing exactly what you are moving, where it lives, how it is structured, and how it changes over time.
The six types of source systems
Ingestion — The First Engineering Problem
Ingestion is the act of reading data from a source system and writing it into the data platform. It sounds simple. In practice, it is where most data engineering complexity lives, because source systems were not designed to be read from by analytics pipelines. They were designed to serve applications. Ingestion is the process of extracting data without disrupting that primary purpose.
The two fundamental approaches to ingestion
Every ingestion pipeline is either batch or streaming — and the choice is determined entirely by how fresh the data needs to be downstream.
Full load vs incremental load
Beyond batch vs streaming, there is a second axis: do you pull all the data every time (full load) or only the data that is new or changed since the last run (incremental load)?
Full load is simple — truncate the destination table, reload everything from the source. It works well when the source table is small (under a few million rows) and the cost of re-reading everything is acceptable. The advantage is simplicity: there is no logic to track what has changed, and the destination always exactly mirrors the source.
Incremental load is necessary when the source table is large or when network/compute costs of full loads are too high. You only pull records that were created or modified after the last run. This requires a reliable way to identify new and changed records — typically a timestamp column (updated_at), an auto-incrementing ID, or a change data capture log.
Question 1: How fresh does the data need to be?
Within seconds → Streaming ingestion (Kafka, Event Hubs)
Within hours → Batch ingestion (scheduled pipeline)
Within days → Batch ingestion (daily or less frequent)
Question 2: How large is the source?
Small (< 1M rows, stable) → Full load each run — keep it simple
Large (> 1M rows) → Incremental load — only pull changes
Question 3: How do you identify what has changed?
updated_at timestamp exists → Filter WHERE updated_at > last_run_time
Auto-increment ID exists → Filter WHERE id > last_max_id
Neither exists → Use CDC (read the database transaction log)
File-based source → Track which files have been processed
Real example — Swiggy orders ingestion decision:
Freshness needed: hourly for ops dashboards
Source size: 3M+ rows per day, growing
Change detection: order_time timestamp + order_id auto-increment
Decision: Incremental batch, runs every hour,
filters WHERE order_time > last_checkpointThe Landing Zone — Where Raw Data First Arrives
The landing zone is the first destination inside the data platform. It is a temporary holding area — data arrives here exactly as it came from the source, with no transformation, no cleaning, no schema enforcement. Its only job is to accept everything that arrives and hold it until the next processing stage picks it up.
Think of it like the receiving dock of a warehouse. Every delivery arrives here first, regardless of what it is or what condition it is in. Nothing is unpacked or inspected at the dock — it is just logged and stored temporarily. Processing happens later, in a different area.
Why the landing zone exists
The landing zone serves a critical purpose that beginners often do not appreciate until they have worked without one: it decouples ingestion from processing.
Without a landing zone, the ingestion pipeline writes directly to the processing layer. If the processing logic has a bug, the pipeline must be stopped, the bug fixed, and all the data re-ingested from the source. For sources that do not keep history (like some APIs and streaming systems), that data may be gone forever.
With a landing zone, the raw data is always preserved. If the processing logic has a bug, you fix it and reprocess from the landing zone — the source does not need to be touched. This pattern of preserving raw data permanently is one of the most important principles in data engineering.
Landing zone: Azure Data Lake Storage (ADLS Gen2)
Container: raw-landing
/orders/
2026/03/17/08/ ← hour-partitioned
batch_001.json
batch_002.json
2026/03/17/09/
batch_001.json
/payments/
2026/03/17/
razorpay_export_20260317.csv
/user_events/
2026/03/17/09/15/ ← minute-partitioned (high volume)
events_1710664500.parquet
Key rules for landing zones:
✓ Never delete raw files (storage is cheap; source access is not guaranteed)
✓ Partition by date and time for efficient downstream reads
✓ Store in original format — do not transform at this stage
✓ Add metadata: source name, ingestion timestamp, pipeline run ID
✓ Retain for at least 30–90 days (often much longer for compliance)Bronze, Silver, Gold — The Medallion Architecture
Once data has landed in the platform, it moves through a series of processing layers. The most widely adopted pattern for organising these layers is called the Medallion Architecture — named for its three tiers: Bronze, Silver, and Gold. Understanding this pattern is essential because almost every modern data platform uses some version of it.
The key insight behind the Medallion Architecture is this: each layer has a clearly defined purpose and a clearly defined quality level. Data gets cleaner, more structured, and more business-specific as it moves from Bronze to Gold. You always know exactly what to expect in each layer.
SOURCE SYSTEMS
PostgreSQL (orders) ──┐
Kafka (user events) ──┤
Razorpay API ──┤──→ LANDING ZONE (raw files, any format)
CSV from vendors ──┤ │
Google Analytics API ──┘ │
▼
┌─────── BRONZE ────────┐
│ Raw Parquet files │
│ Partitioned by date │
│ All data preserved │
│ Ingestion metadata │
└──────────┬────────────┘
│ clean + conform
▼
┌─────── SILVER ────────┐
│ Deduplicated │
│ Validated types │
│ Standardised schema │
│ Business rules applied│
└──────────┬────────────┘
│ aggregate + shape
▼
┌─────── GOLD ──────────┐
│ Business metrics │
│ Pre-aggregated │
│ Denormalised │
│ Dashboard-ready │
└──────────┬────────────┘
│
┌─────────────────────┼──────────────────────┐
▼ ▼ ▼
Power BI / Data Science ML Feature
Tableau Notebooks Store
Dashboards Model Training Real-time
ServingThe Data Warehouse — Where Data Becomes Queryable
A data warehouse is a database designed specifically for analytical queries — fast aggregations and scans over large amounts of data using SQL. Gold layer tables typically live in a data warehouse, because the warehouse is what business analysts and dashboards connect to.
The fundamental design difference between a data warehouse and a regular operational database is storage orientation. This single difference explains why warehouses are so much faster for analytics.
Row storage vs columnar storage
In a regular row-oriented database, all the values for one row are stored together on disk. Reading a row is fast. But analytical queries typically need to scan one or two columns across millions of rows — like summing the order_amount column across 100 million orders. In a row store, reading that column requires reading every column of every row — most of the data read is immediately discarded.
ROW-ORIENTED STORAGE (PostgreSQL, MySQL):
Disk block 1: [order_id=1, customer="Priya", city="Mumbai", amount=380, status="delivered"]
Disk block 2: [order_id=2, customer="Rahul", city="Bangalore", amount=220, status="cancelled"]
Disk block 3: [order_id=3, customer="Aisha", city="Hyderabad", amount=540, status="delivered"]
Query: SELECT SUM(amount) FROM orders
Must read: ALL columns for ALL rows → massive I/O for one column
COLUMNAR STORAGE (Snowflake, BigQuery, Redshift):
Column "order_id": [1, 2, 3, 4, 5, 6, 7, 8 ...]
Column "customer": ["Priya", "Rahul", "Aisha" ...]
Column "city": ["Mumbai", "Bangalore", "Hyderabad" ...]
Column "amount": [380, 220, 540, 180, 760, 320, 450, 290 ...]
Column "status": ["delivered", "cancelled", "delivered" ...]
Query: SELECT SUM(amount) FROM orders
Reads: ONLY the "amount" column → 10–100× less I/O
Bonus: similar values in a column compress extremely well
[380, 380, 380, 380] → stored as "380 × 4"
→ further reduces storage and read timeThis is why a query that takes 45 minutes on a PostgreSQL table of 100 million rows can run in 4 seconds on Snowflake or BigQuery. The data warehouse reads only the columns the query needs. Less data read from disk means dramatically faster queries.
The Serving Layer — Getting Data to the People Who Need It
The serving layer is everything between the warehouse and the final consumer of the data. Even perfectly engineered Gold tables are useless if the people who need the data cannot access it easily, quickly, and in the right format.
Who consumes data and how
Batch vs Real-Time — Two Completely Different Paths
Everything described so far — landing zone, Bronze, Silver, Gold — describes the batch data path. Data moves in scheduled waves, typically every hour or every day. But some business decisions cannot wait hours. For those cases, there is a parallel path: the real-time streaming pipeline.
Both paths often exist simultaneously at the same company, serving different consumers with different freshness requirements.
SOURCE: Zomato order events
BATCH PATH (runs every hour):
Kafka → S3 landing zone → Bronze (Parquet) → Silver (cleaned) → Gold (aggregated)
Latency: 1–3 hours
Consumers: Business dashboards, weekly reports, model training
Cost: Low (runs once per hour, then stops)
Complexity: Low (simple ETL, easy to debug)
STREAMING PATH (always running):
Kafka → Stream processor (Flink/Spark Streaming) → Real-time store (Redis/DynamoDB)
Latency: 1–10 seconds
Consumers: Live order tracking, fraud detection, real-time driver dispatch
Cost: High (always running, dedicated infrastructure)
Complexity: High (stateful processing, exactly-once guarantees, backpressure)
DECISION RULE:
"Does a bad decision made on 1-hour-old data cost more
than the cost and complexity of a streaming pipeline?"
If YES → streaming
If NO → batch
Most companies: 80% batch, 20% streaming
Most companies need streaming for: fraud detection, live inventory,
real-time recommendations, operational dashboardsThe Complete Data Platform — All Layers in One View
Now that you understand each individual layer, here is the complete picture — every layer of a modern data platform, from source to consumer, in one view. This is the architecture you will encounter at most well-engineered Indian tech companies in 2026.
┌─────────────────────────────────────────────────────────────────┐
│ SOURCE SYSTEMS │
│ PostgreSQL │ Kafka streams │ REST APIs │ Files │ Logs │
└──────────────────────────┬──────────────────────────────────────┘
│ ingestion (batch + streaming)
▼
┌─────────────────────────────────────────────────────────────────┐
│ LANDING ZONE │
│ Raw files, original format, never deleted │
│ Azure ADLS / Amazon S3 / Google Cloud Storage │
└──────────────────────────┬──────────────────────────────────────┘
│ format standardisation + partitioning
▼
┌─────────────────────────────────────────────────────────────────┐
│ BRONZE LAYER │
│ Parquet / Delta Lake · Partitioned · Metadata added │
│ All data preserved · No business logic applied │
└──────────────────────────┬──────────────────────────────────────┘
│ cleaning + conforming (dbt / Spark)
▼
┌─────────────────────────────────────────────────────────────────┐
│ SILVER LAYER │
│ Deduplicated · Validated · Standardised · Business rules │
│ One row per entity · Joined with reference data │
└──────────────────────────┬──────────────────────────────────────┘
│ aggregation + business shaping
▼
┌─────────────────────────────────────────────────────────────────┐
│ GOLD LAYER │
│ Business metrics · Pre-aggregated · Denormalised │
│ Named in business terms · SLA-bound refresh │
└──────────────────────────┬──────────────────────────────────────┘
│
┌────────────────┼────────────────┬────────────────┐
▼ ▼ ▼ ▼
BI Dashboards Data Science ML Feature Reverse ETL
Power BI Notebooks Store → CRM
Tableau Model Training Real-time → App DB
Metabase Experiments Predictions → Marketing
CROSS-CUTTING:
Orchestration (Airflow) → schedules and monitors every step
Data Quality (dbt tests) → validates each layer's output
Observability (alerts) → notifies when pipelines fail or data degrades
Governance (catalogue) → documents what exists and who can access itEvery company's data platform is a variation of this pattern. The specific tools differ — one company uses Airflow, another uses Prefect; one uses Snowflake, another uses BigQuery. But the layers and their purposes are consistent across the industry. When you join a new team, you will immediately know where to look for each type of data and who is responsible for each layer.
Tracing a Data Quality Bug Through All the Layers
It is 9:30 AM. The Head of Growth opens the weekly GMV dashboard and immediately sends a Slack message to the data team: "Weekend GMV looks 18% lower than last weekend. Is this real or a data issue?" You are the on-call data engineer.
Step 1 — Check the Gold layer
You query the Gold daily_order_metrics table. Saturday shows ₹3.8 crore. Last Saturday showed ₹4.6 crore. The drop is there. But is it real business data or a pipeline problem?
Step 2 — Check the Silver layer
You count rows in the Silver orders_cleaned table for Saturday. You find 420,000 orders. That is significantly lower than the typical 520,000 on a Saturday. The missing orders are not in Silver. The problem happened before Silver.
Step 3 — Check the Bronze layer
You count rows in the Bronze orders_raw table for Saturday. 520,000 rows. The orders are in Bronze. The problem happened between Bronze and Silver.
Step 4 — Find the transformation bug
You look at the dbt model that transforms Bronze to Silver. You find a filter: WHERE order_status != 'test'. You check the Bronze data for Saturday — 100,000 orders have order_status = 'test'. Last week it was 2,000. The engineering team deployed a change on Friday that accidentally tagged all restaurant-side orders as 'test'. 100,000 real orders were silently excluded from Silver and never made it to Gold.
The resolution
You raise an incident, notify the Head of Growth that it is a data issue (not a business issue), coordinate with the engineering team to fix the status tagging bug in production, then backfill the Silver and Gold layers for Saturday and Sunday once the source data is corrected. Total resolution time: 3 hours.
This is why understanding the layers matters. Without knowing the architecture, you cannot systematically trace where data was lost. With it, you bisect the problem in four queries instead of spending hours guessing.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Every piece of data follows the same journey: Source System → Landing Zone → Bronze → Silver → Gold → Consumer. Understanding this path means you can always find where data was lost or changed.
- ✓Source systems are not built to serve analytics. They are built to run the business. The data engineer job is to capture data from them without disrupting their primary purpose.
- ✓The landing zone preserves raw data exactly as it arrived. Never delete it. It is your ability to reprocess everything from scratch when transformation logic changes.
- ✓Bronze preserves everything in a queryable columnar format. Silver cleans and conforms to one trusted record per entity. Gold aggregates and shapes data for specific business questions.
- ✓Columnar storage (Parquet, Delta Lake) is the foundation of fast analytical queries. It allows reading only the columns needed and enables aggressive compression, making queries 10–100× faster than row-oriented stores.
- ✓Batch ingestion is simple, cheap, and correct for most use cases. Streaming is expensive and complex. Choose streaming only when you can identify a specific business decision that requires data fresher than one hour.
- ✓The Medallion Architecture (Bronze-Silver-Gold) is the industry standard pattern. Learn it once and you understand the data platform structure at almost every modern company.
- ✓Data discrepancies should always be investigated by layer bisection — start at the dashboard, work backwards to the source, identify exactly where numbers diverge. Four queries, not four hours.
- ✓A single data point — one Zomato order — can touch eight systems in 14 hours before it appears in a dashboard metric. Every system in that chain is something a data engineer owns.
- ✓The serving layer is not just "put data in the warehouse." It means different things for different consumers: pre-aggregated Gold for analysts, Silver features for scientists, feature stores for ML engineers, reverse ETL for operational systems.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.