Data Engineering — Module 07Beginner+100 XP

Structured, Semi-Structured and Unstructured Data

The three categories every data engineer works with — what each demands from your pipeline.

45 min March 2026

// Part 01 — Why These Categories Matter

The Three Categories — Why Every Data Engineer Must Know Them

Walk into any data engineering interview and say "I can handle all types of data" — you will be asked to prove it immediately. Because the three categories of data require fundamentally different storage systems, different parsing approaches, different transformation techniques, and different infrastructure choices. Confusing them in a pipeline causes silent bugs. Choosing the wrong storage for a category creates performance problems that cannot be fixed without rebuilding.

Every data source you will ever encounter belongs to one of three categories. The moment you identify which category a new data source belongs to, you know which tools to reach for, which ingestion approach to use, and which complications to expect. That pattern recognition is what this module builds.

Structured

Data with a rigid, predefined schema — rows and columns where every value has a known type and position.

e.g. PostgreSQL tables, CSV files, Excel sheets

~20% of all enterprise data

Semi-Structured

Data with some organisation but no rigid schema — self-describing format where structure can vary between records.

e.g. JSON, XML, Avro, Parquet, log files

~40% of all enterprise data

Unstructured

Data with no predefined schema and no inherent organisation — raw content that requires interpretation to extract meaning.

e.g. Images, audio, video, free-text, PDFs, emails

~80% of all data generated globally

💡 Note

The percentage paradox: Structured data is only 20% of enterprise data but accounts for 80% of what gets analysed, because it is the easiest to query. Unstructured data is 80% of all data generated but most organisations cannot use it without specialised ML models. As a data engineer, you will spend the most time on structured and semi-structured data — but understanding unstructured data is increasingly critical as AI workloads grow.

// Part 02 — Category One

📋

Structured Data

Structured data has a fixed, predefined schema. Every record has exactly the same fields, every field has a known data type, and every value occupies a known position. This rigidity is what makes structured data easy to query, easy to validate, and easy to analyse — but also what makes it break when reality does not fit neatly into rows and columns.

What makes it "structured"

Structure means: before you read a single row of data, you know exactly what you will find. A relational database table defining orders has columns named order_id, customer_id, order_amount, and created_at. Every row has exactly these four fields. order_id is always an integer. order_amount is always a decimal. created_at is always a timestamp. The schema is the contract, and the database enforces it.

Structured data — a relational table with strict schema

Table: orders (PostgreSQL)

Schema (the contract):
  order_id      BIGINT       NOT NULL PRIMARY KEY
  customer_id   BIGINT       NOT NULL
  restaurant_id INTEGER      NOT NULL
  order_amount  DECIMAL(10,2) NOT NULL
  status        VARCHAR(20)  NOT NULL CHECK (status IN ('placed','confirmed','delivered','cancelled'))
  created_at    TIMESTAMP    NOT NULL DEFAULT NOW()

Row 1: [9284751, 4201938, 7823, 380.00, 'delivered', '2026-03-17 20:14:32']
Row 2: [9284752, 1092847, 2341, 220.00, 'cancelled', '2026-03-17 20:15:01']
Row 3: [9284753, 8374621, 7823, 540.00, 'confirmed', '2026-03-17 20:15:12']

Properties:
  ✓ Every row has exactly the same columns
  ✓ Each column has a fixed data type enforced by the database
  ✓ Invalid data is REJECTED — try inserting order_amount='abc' and it fails
  ✓ NULL constraints prevent missing required fields
  ✓ CHECK constraints enforce business rules (valid status values)

This predictability is what makes SQL queries fast and reliable.
A query engine knows exactly where each field is without parsing.

Where structured data lives

Structured data primarily lives in relational databases — PostgreSQL, MySQL, SQL Server, Oracle — and in structured file formats like CSV (Comma-Separated Values) and TSV (Tab-Separated Values). Columnar formats like Parquet and ORC are also structured — they have a defined schema — but they store data in a columnar layout optimised for analytics rather than the row-oriented layout of operational databases.

What structured data demands from your pipeline

Schema change handling

When a developer adds a new column to a production database, your ingestion pipeline must handle it. If you are doing full-load ingestion, the new column appears automatically. If you are doing incremental ingestion by querying specific columns, you miss it until you update the query. Schema changes are the most common cause of pipeline failures in structured data ingestion. Always have a monitoring layer that detects when source schema differs from your expected schema.

Type fidelity

The data type in the source must be preserved correctly in the destination. A DECIMAL(10,2) amount column in PostgreSQL must not be loaded as FLOAT in the warehouse — floating point precision loss in financial data is a serious bug. A TIMESTAMP WITH TIME ZONE column must carry timezone information through the pipeline, not be silently converted to a naive datetime. Every type conversion in a pipeline is a potential data quality issue.

NULL semantics

NULL in a relational database has specific semantics: the value is unknown, not absent, not zero, not empty string. Different systems handle NULLs differently. A NULL in PostgreSQL, stored as CSV (where it becomes an empty field), loaded into Snowflake can behave differently in WHERE clauses and aggregations. Test your NULL handling explicitly — do not assume it passes through correctly.

Incremental extraction

Structured databases are usually OLTP systems under active transactional load. Extracting all data every run (full load) is acceptable for small tables but becomes too slow and too resource-intensive as tables grow. Incremental extraction — pulling only rows created or modified since the last run — requires a reliable high-watermark column (usually a timestamp or auto-incrementing ID). Not all tables have one, which forces you to use CDC instead.

The CSV trap — structured format, unstructured reality

CSV files look like structured data — they have rows and columns, often with a header row that names the columns. But CSV has no schema enforcement. Nothing stops a row from having more or fewer columns than the header. Nothing enforces data types — everything is text. Nothing prevents nulls from appearing in any field. A data engineer who treats CSV as reliably structured has not yet encountered a CSV file from a vendor, a finance team, or an ERP export.

CSV — what looks structured but is not enforced

"correct" CSV file:
order_id,customer_id,amount,status
9284751,4201938,380.00,delivered
9284752,1092847,220.00,cancelled

Reality of vendor-supplied CSV files:
  Row with wrong column count:
    9284753,8374621,540.00,confirmed,extra_column_nobody_warned_you_about

  Row with wrong data type:
    9284754,4201938,N/A,delivered    ← amount is text, not decimal

  Row with ambiguous null:
    9284755,,380.00,delivered         ← customer_id is empty — null or 0?

  Row with embedded delimiter:
    9284756,4201938,"1,380.00",delivered  ← amount contains a comma
                                            needs quoting — did the exporter handle it?

  Row with encoding issue:
    9284757,4201938,380.00,délivré     ← accented character — what encoding?

  Header row missing entirely:
    9284758,4201938,540.00,confirmed   ← is row 1 data or headers?

Pipeline rule: always validate CSV structure before processing.
Never assume a CSV is well-formed because it came from a "reliable" source.

// Part 03 — Category Two

🔀

Semi-Structured Data

Semi-structured data has some organisation — fields have names, values have types — but the structure is not enforced by an external schema. The data carries its structure within itself. This "self-describing" property is both its greatest strength and its greatest source of engineering complexity.

The strength: semi-structured formats can represent hierarchical, nested, and variable-length data that relational tables cannot. A single JSON object can represent an order with a nested array of items, a nested customer object, and optional fields that only appear for some order types — all in one document without needing six related tables.

The complexity: because the structure is self-declared rather than enforced, it can change at any time. A field that was always present in past records can be absent in new ones. A field that was always a string becomes an array. A new nested object appears. None of these changes cause an error at the source — but all of them can silently break your downstream pipeline.

JSON — the dominant semi-structured format

JSON (JavaScript Object Notation) is the most common data format a data engineer encounters. Every REST API returns JSON. Every modern event stream carries JSON payloads. Every NoSQL database stores JSON documents. Understanding JSON deeply is not optional.

JSON — structure, nesting, and the engineering challenges it creates

A single DoorDash order as JSON — what an API actually returns:

{
  "order_id": 9284751,
  "created_at": "2026-03-17T20:14:32+05:30",
  "customer": {
    "id": 4201938,
    "name": "Priya Sharma",
    "phone": "+91-9876543210",
    "address": {
      "flat": "4B",
      "building": "Prestige Meridian",
      "area": "Koramangala",
      "city": "Seattle",
      "zip_code": "560034",
      "lat": 12.9352,
      "lng": 77.6245
    }
  },
  "restaurant": {
    "id": 7823,
    "name": "Punjabi Dhaba",
    "city": "Seattle"
  },
  "items": [
    {
      "id": "MI-001",
      "name": "Butter Chicken",
      "quantity": 1,
      "unit_price": 320.00,
      "customisations": ["extra gravy"]
    },
    {
      "id": "MI-002",
      "name": "Garlic Naan",
      "quantity": 2,
      "unit_price": 30.00,
      "customisations": []
    }
  ],
  "payment": {
    "method": "UPI",
    "upi_id": "priya@ybl",
    "amount": 380.00,
    "status": "captured"
  },
  "delivery": {
    "agent_id": 83921,
    "estimated_time_mins": 35,
    "actual_time_mins": 42
  },
  "promo_code": null
}

Engineering challenges this single JSON creates:

1. NESTING — customer.address.city is 3 levels deep.
   To get city into a flat table you must flatten the hierarchy.
   SQL: customer['address']['city'] or JSON_EXTRACT()

2. ARRAYS — items is a list of variable length.
   1 order = 1 JSON object but potentially many items rows.
   Must EXPLODE/UNNEST the array to get one row per item.

3. OPTIONAL FIELDS — promo_code is null here.
   Other orders might not have the promo_code key at all.
   Code that does order['promo_code'] crashes on those.
   Code that does order.get('promo_code') handles it.

4. TYPE INCONSISTENCY — actual_time_mins might be null
   for orders still in progress. Your pipeline must handle
   both integer and null for the same field.

5. TIMESTAMP FORMAT — "2026-03-17T20:14:32+05:30"
   is ISO 8601 with timezone. Some records might have
   "2026-03-17 20:14:32" (no timezone). Parse explicitly.

The flattening problem — from nested JSON to flat tables

Relational databases and data warehouses store data in flat tables — rows and columns with no nesting. JSON is hierarchical. The process of converting nested JSON into flat table rows is called flattening ornormalisation, and it is one of the most common transformation tasks in data engineering.

Flattening a nested JSON order into relational tables

Input: 1 JSON order object (as shown above)

Output: multiple flat table rows

Table: orders (one row per order)
  order_id | created_at          | customer_id | restaurant_id | total_amount | payment_method | promo_code
  9284751  | 2026-03-17 20:14:32 | 4201938     | 7823          | 380.00       | UPI            | NULL

Table: order_items (one row per item — EXPLODED from items array)
  order_id | item_id | item_name      | quantity | unit_price | customisations
  9284751  | MI-001  | Butter Chicken | 1        | 320.00     | extra gravy
  9284751  | MI-002  | Garlic Naan    | 2        | 30.00      | (empty)

Table: customers (one row per customer — EXTRACTED from nested customer object)
  customer_id | name          | phone           | city      | zip_code | lat     | lng
  4201938     | Priya Sharma  | +91-9876543210  | Seattle | 560034  | 12.9352 | 77.6245

Table: deliveries (one row per delivery)
  order_id | agent_id | estimated_mins | actual_mins
  9284751  | 83921    | 35             | 42

Key decision: how deep to flatten?
  Deep flatten: extract every nested field into its own column
    → Easy for analysts to query, but breaks when nested structure changes
  Shallow flatten: keep some nesting as JSON column in warehouse
    → More resilient to schema changes, but requires JSON functions to query
  
Most teams use a hybrid: flatten the top-level fields you know you need,
keep rarely-needed deep nesting as a JSON column for flexibility.

XML — the older semi-structured format

XML (Extensible Markup Language) predates JSON and is more verbose — the same data takes roughly 3–4× more bytes to represent. But XML is still heavily used in enterprise systems, government data exchanges, healthcare (HL7 FHIR), and financial data (SWIFT, FIX protocol). If you work with legacy enterprise clients or government data sources, you will encounter XML regularly.

The same order data in XML — more verbose, same engineering challenges

<?xml version="1.0" encoding="UTF-8"?>
<order id="9284751">
  <created_at>2026-03-17T20:14:32+05:30</created_at>
  <customer id="4201938">
    <name>Priya Sharma</name>
    <address>
      <city>Seattle</city>
      <zip_code>560034</zip_code>
    </address>
  </customer>
  <items>
    <item id="MI-001">
      <name>Butter Chicken</name>
      <quantity>1</quantity>
      <unit_price>320.00</unit_price>
    </item>
    <item id="MI-002">
      <name>Garlic Naan</name>
      <quantity>2</quantity>
      <unit_price>30.00</unit_price>
    </item>
  </items>
  <payment method="UPI" amount="380.00" status="captured"/>
</order>

XML-specific engineering challenges:
  → Namespaces: <ns0:order xmlns:ns0="http://schema.example.com/orders">
    Different XML documents use namespace prefixes differently.
    Always strip or normalise namespaces before parsing.

  → Attributes vs elements: both <amount>380</amount> and
    <payment amount="380"/> are valid XML for the same data.
    Your parser must handle both.

  → CDATA sections: text content that may contain special characters
    is wrapped in <![CDATA[...]]> — must be unwrapped before use.

  → Large XML files cannot be loaded into memory entirely.
    Use streaming parsers (SAX in Python) for files > a few MB.

Log files — semi-structured but barely

Application log files are technically semi-structured — each line has a timestamp, a severity, and a message — but the structure is inconsistently applied and the message content is free text. Log files are the most challenging form of semi-structured data because the "schema" is defined by developers writing log statements, not by a formal specification.

Log file — semi-structured but requiring careful parsing

Raw application log (what you receive):
2026-03-17 20:14:32.847 INFO  [OrderService] Order 9284751 placed by customer 4201938
2026-03-17 20:14:33.012 INFO  [PaymentService] Payment captured: ₹380.00 via UPI
2026-03-17 20:14:33.241 DEBUG [RestaurantService] Notifying restaurant 7823
2026-03-17 20:15:02.119 WARN  [OrderService] Delivery estimate exceeding threshold: order 9284751
2026-03-17 20:16:41.003 ERROR [PaymentService] Refund failed for order 9284754: timeout after 3s

Parsing this requires:
  1. Regex to extract timestamp, level, service, message
  2. Further parsing of message field to extract IDs and values
  3. Handling log lines that span multiple lines (stack traces)
  4. Handling encoding variations (₹ symbol)
  5. Dealing with log rotation and multiple log files per hour

Pattern:
  ^(d{4}-d{2}-d{2} d{2}:d{2}:d{2}.d{3})s+
  (INFO|DEBUG|WARN|ERROR)s+
  [([^]]+)]s+(.+)$

Structured log (what modern applications emit — much easier):
  {"timestamp":"2026-03-17T20:14:32.847Z","level":"INFO",
   "service":"OrderService","event":"order_placed",
   "order_id":9284751,"customer_id":4201938}

If you have influence over how your company's services log —
always push for structured JSON logging. It eliminates regex.

What semi-structured data demands from your pipeline

Schema inference and evolution

Semi-structured data has no enforced schema. When a new field is added at the source, your pipeline should ideally detect and handle it automatically, not crash. Use schema inference tools (Spark's inferSchema, BigQuery's autodetect) for flexibility, but validate inferred schemas against expectations. When a field type changes — a price that was a string becomes a number — your pipeline must handle both for a transition period.

Explosion of arrays

When a JSON document contains an array (like the items array in an order), you often need one database row per array element, not one row per document. This explosion is done with UNNEST (SQL), explode() (PySpark), or equivalent functions. Always check whether the explosion is appropriate for the use case — sometimes you want to keep arrays as a JSON column rather than exploding into many rows.

Handling missing fields gracefully

In a JSON document, a field can be missing entirely (different from null). Your code must distinguish between field_not_present and field_is_null — they mean different things. Use .get() with a default in Python rather than direct key access. In SQL, use JSON_EXTRACT_SCALAR with a NULL default rather than assuming the key exists.

Depth decisions — flatten or preserve

Not every nested field needs to be flattened into its own column. Flatten the fields your analysts query most frequently. Keep rarely-accessed deep nesting as a JSON column that analysts can query with JSON functions when needed. This balance between accessibility and schema flexibility is one of the most important architectural decisions in semi-structured data pipelines.

// Part 04 — Category Three

🌊

Unstructured Data

Unstructured data has no predefined schema and no inherent organisation that a machine can automatically interpret. It is raw content — the bytes of an image, the waveform of an audio file, the free text of a customer review, the pages of a PDF contract. 80% of all data generated in the world is unstructured.

For most of data engineering history, unstructured data was stored but largely ignored — too difficult to process at scale without the ML tools to interpret it. That has changed dramatically in the past three years. Every major Indian tech company now has projects that extract structured signals from unstructured data: sentiment from customer reviews, fraud signals from transaction descriptions, product categories from listing images, insights from call recordings.

Types of unstructured data and where they appear

Images

Product photos (Amazon), KYC documents (fintechs), food photos (Uber Eats), satellite imagery (agricultural analytics)

Store as-is in object storage. Extract metadata (dimensions, format, size). ML models extract structure (product category, face detection, OCR for document text).

Audio and Video

Customer service call recordings, video KYC sessions, platform content (YouTube, streaming)

Store compressed in object storage. Extract duration, format, bitrate as metadata. Speech-to-text models produce structured transcripts for analysis.

Free Text

Customer reviews, support tickets, social media comments, email content, chatbot conversations

Store raw text. NLP models extract sentiment, topic, intent, named entities. Output is structured (sentiment: positive, topic: delivery, entity: [Koramangala]).

PDFs and Documents

Contracts, invoices, bank statements, regulatory filings, research papers

Extract text with PDF parsers (pdfplumber, PyMuPDF). For scanned PDFs, OCR first. Structure extraction identifies tables, headers, and key fields.

Binary Files

Proprietary formats, CAD files, medical imaging (DICOM), sensor binary streams

Requires format-specific parsers. Store raw, extract what metadata is possible (file size, creation date, format version).

Emails

Customer communications, vendor invoices, internal business correspondence

Parse headers (from, to, date, subject) as structured fields. Body and attachments remain unstructured. NLP for classification and entity extraction.

How data engineers handle unstructured data

A data engineer's relationship with unstructured data has three distinct responsibilities, each requiring different skills and tools.

The three DE responsibilities with unstructured data

RESPONSIBILITY 1: Store and Organise
  Unstructured data must be stored in object storage (S3, ADLS, GCS)
  with a clear organisation scheme so it can be found and processed.

  Bad:  s3://company-data/uploads/file1.jpg
  Good: s3://company-data/raw/product-images/category=electronics/
          sku=SKU-00283741/2026-03-17/image_001.jpg

  Good organisation includes:
    → Meaningful key prefix (not random UUIDs)
    → Partition by date for time-range access
    → Partition by category/type for selective processing
    → Consistent naming convention

RESPONSIBILITY 2: Extract Metadata
  Even before ML processing, extract the metadata that IS structured:

  For images:    file_size, format (JPEG/PNG), dimensions (width×height),
                 creation_date, source_system, associated_entity_id
  For audio:     file_size, format (MP3/WAV), duration_seconds, sample_rate,
                 call_id, agent_id, customer_id, call_timestamp
  For documents: file_size, page_count, creation_date, author (if available),
                 document_type, associated_case_id

  This metadata is structured and immediately useful for filtering,
  capacity planning, and processing queue management.

RESPONSIBILITY 3: Build Processing Pipelines
  Orchestrate the ML models or transformation tools that convert
  unstructured content into structured signals.

  Image pipeline:
    Raw image → resize/normalise → ML model → structured output
    {sku: "SKU-283741", category: "electronics", has_watermark: false,
     background: "white", product_visible: true, quality_score: 0.94}

  Customer review pipeline:
    Raw text → clean/normalise → NLP model → structured output
    {review_id: 8734, sentiment: "negative", score: 0.23,
     topics: ["delivery", "packaging"], entities: ["Seattle"],
     urgency: "high", requires_response: true}

  Call recording pipeline:
    Audio file → speech-to-text → transcript → NLP → structured output
    {call_id: "CALL-83921", duration_secs: 247, transcript: "...",
     sentiment_overall: "frustrated", resolution: "refund_initiated",
     agent_empathy_score: 0.78}

The data lakehouse for unstructured data

Unstructured data cannot live in a data warehouse — warehouses store tables, not images or audio files. Object storage (S3, ADLS, GCS) is the only viable storage at scale for unstructured data. The data engineer's job is to build a metadata layer that makes the unstructured data discoverable and processable alongside the structured and semi-structured data.

The practical pattern: store raw unstructured data in object storage, extract metadata into a structured table, and store ML model outputs as structured data in the warehouse. The warehouse contains the derived structured signals; the raw unstructured data lives in the lake. Analysts query the warehouse; ML engineers query the lake directly.

// Part 05 — Direct Comparison

All Three Categories — The Complete Comparison

Dimension	Structured	Semi-Structured	Unstructured
Schema	Fixed, enforced by external system	Self-describing, flexible — carried within the data	None — no inherent organisation
Examples	SQL tables, CSV with headers, Excel	JSON, XML, Avro, Parquet, log files	Images, audio, video, free text, PDFs, emails
Primary storage	Relational DB, Data Warehouse	Object store (S3/ADLS) + document store	Object store only (S3/ADLS/GCS)
Query method	SQL	SQL + JSON functions	ML models only — cannot SQL-query raw pixels
Schema change	Hard — breaks pipelines	Flexible — new fields just appear	N/A — no schema to change
Best for	Transactional data, reporting, analytics	Events, APIs, configs, logs, hierarchical data	Content, media, documents, sensor raw streams
Transformation	SQL CTEs, dbt models	Flatten, explode, parse, normalise	OCR, speech-to-text, image classification, NLP, embedding
Warehouse ready?	Yes — native table storage	Needs flattening first	No — store outputs of ML processing
Compression	Moderate (columnar storage 5–10×)	Very high (repetitive keys compress well)	Format-dependent (images: 10–20×, audio: 5–10×)
Pipeline failures	Schema drift, type mismatches, NULL propagation	Missing fields, type changes, array explosion	Model drift, storage costs, processing time

// Part 06 — Storage Decisions

Storage Implications — Which Category Goes Where

The category of data determines which storage system is appropriate. Choosing the wrong storage for a data category does not cause an immediate error — you can technically store anything anywhere. But the wrong choice creates query performance problems, cost inefficiencies, and maintenance nightmares that compound over months and years.

Relational Databases

Best for

Structured data for operational workloads

Also works

Small semi-structured data (PostgreSQL supports JSONB columns)

Avoid for

Large semi-structured data, unstructured data, analytics at scale

Relational databases are optimised for fast individual record access with ACID transactions. They enforce schema. They handle thousands of concurrent writes. They are not designed for scanning millions of rows for analytics or storing large binary files.

Data Warehouses (Snowflake, BigQuery, Redshift)

Best for

Structured data for analytics, flattened semi-structured data

Also works

JSON/semi-structured data stored in VARIANT/JSON columns with JSON functions

Avoid for

Raw unstructured data (images, audio), extremely high-cardinality semi-structured with frequent schema changes

Warehouses are columnar SQL engines. They excel at scanning millions of structured rows. Modern warehouses (Snowflake VARIANT, BigQuery JSON) handle semi-structured data well with JSON-aware functions. They cannot store or query binary files meaningfully.

Object Storage (S3, ADLS, GCS)

Best for

All three categories at scale — the universal storage layer

Also works

Everything from raw CSV to Parquet to images to audio to video

Avoid for

Low-latency random access (millisecond response queries)

Object storage is cheap, infinite, and accepts any format. It is the foundation of the modern data lake. With table formats (Delta Lake, Iceberg) on top, it handles structured and semi-structured data with ACID semantics. For unstructured data it is the only viable option at scale.

Document Databases (MongoDB, Firestore, DynamoDB)

Best for

Semi-structured data for operational workloads with flexible schemas

Also works

Product catalogues (variable attributes per product), user profiles, CMS content

Avoid for

Analytics and reporting (no native SQL aggregations), strongly relational data, large files

Document databases store JSON documents natively. Each document can have a different structure. Fast for reading and writing individual documents. Not designed for cross-document aggregations that analytics requires. Often serve as source systems that data engineers ingest from.

// Part 07 — Real World Examples

How Indian Companies Handle All Three Types

Every major Indian tech company deals with all three data categories simultaneously. Here is what that looks like in practice at three different types of organisations.

Amazon — E-commerce

Structured

Orders table: order_id, customer_id, product_id, quantity, price, status, created_at

Inventory table: sku, warehouse_id, quantity_available, reorder_threshold

Pricing table: sku, price, discount_pct, valid_from, valid_to

Seller settlements: seller_id, settlement_amount, bank_account, settlement_date

Semi-Structured

Product catalogue: JSON documents with variable attributes per category (electronics have RAM/storage, clothing has size/color/fabric)

User behaviour events: JSON events from the app (page_viewed, product_clicked, add_to_cart, search_query)

Seller API responses: JSON from third-party seller systems with inconsistent schemas

Search logs: semi-structured text with query, results, and click data

Unstructured

Product images: 100M+ SKU images stored in S3, classified by ML models for search ranking

Customer review text: free-text reviews, NLP models extract sentiment and topic

Seller onboarding documents: PDFs of GST certificates, PAN cards — OCR extracts structured fields

Venmo — Fintech

Structured

Transaction table: txn_id, sender_id, receiver_id, amount, type, status, timestamp

UPI mandate table: mandate_id, customer_id, merchant_id, amount_limit, frequency

Settlement table: settlement_id, bank_id, amount, settlement_time, status

Fraud labels: txn_id, is_fraud, fraud_type, reviewed_by, review_timestamp

Semi-Structured

Transaction metadata: JSON blob with payment rail details, device fingerprint, session context

Bank API responses: JSON responses from 200+ bank APIs, each with slightly different schema

Push notification events: JSON events tracking delivery, opens, clicks per notification

Compliance logs: structured + semi-structured audit trail of every system action

Unstructured

Video KYC recordings: stored in S3, ML models verify liveness and identity document authenticity

Customer support call recordings: speech-to-text, then NLP for intent classification and resolution tagging

Dispute document uploads: images and PDFs of transaction receipts for chargeback handling

Apollo Hospitals — Healthcare

Structured

Patient table: patient_id, name, DOB, blood_group, allergies, primary_doctor_id

Appointment table: appointment_id, patient_id, doctor_id, scheduled_at, status

Billing table: bill_id, patient_id, total_amount, insurance_claimed, paid_amount

Lab results table: result_id, patient_id, test_type, value, unit, reference_range, abnormal_flag

Semi-Structured

EHR (Electronic Health Records): HL7 FHIR format — XML/JSON documents with nested clinical observations

Prescription data: semi-structured drug names, dosages, and instructions — inconsistent across doctors

Medical device outputs: JSON streams from monitoring equipment (ECG, blood pressure monitors)

Unstructured

X-rays and MRI scans: DICOM format images, AI models assist in radiology interpretation

Doctor notes: free-text clinical notes, NLP extracts diagnoses, medications, and procedures for coding

Pathology report PDFs: scanned and digital, OCR extracts test results into structured records

// Part 08 — Real World

💼 What This Looks Like at Work

Designing a Pipeline That Handles All Three Types at Once

Scenario — Insurance Tech Startup · New Data Pipeline Task

Your manager gives you a task: "Build a pipeline that ingests all data from our new claim processing workflow and makes it available for analysis. The workflow generates a claim record in PostgreSQL, a JSON event payload from the mobile app, and a scanned PDF of the supporting document."

One business process, three data types. Here is how you approach it.

Step 1 — Classify each source: The PostgreSQL claim record is structured — fixed schema, enforced types, reliable. The JSON event from the app is semi-structured — self-describing but potentially inconsistent across app versions. The scanned PDF is unstructured — binary content requiring OCR before any fields can be extracted.

Step 2 — Design separate ingestion paths for each type:The PostgreSQL record goes through standard incremental SQL extraction into the Bronze layer as Parquet. The JSON event goes through a Kafka consumer that reads the event stream and lands raw JSON files in the Bronze layer. The PDF goes through an S3 PUT event trigger that fires an OCR pipeline when a new document is uploaded.

Step 3 — Design converging transformation: In the Silver layer, all three inputs transform into structured tables. The PostgreSQL data cleans directly into a claims_silver table. The JSON event is flattened — claim_id, device_type, app_version, GPS coordinates, timestamp — into a claim_events_silver table. The OCR output extracts document_type, issue_date, insured_name, and document_verified flag into a claim_documents_silver table.

Step 4 — Join in Gold: A single Gold table joins all three Silver tables on claim_id, giving analysts one row per claim with all structured fields from all three sources. The raw PDF is linked via a URL column pointing to S3 for auditors who need to view the original document.

The lesson: A data engineer who only knows one data type is blocked at step one. A data engineer who understands all three can design the architecture, identify the tools needed for each path, and deliver a unified output that hides the complexity from the analyst who just sees a clean SQL table.

// Part 09 — Interview Prep

5 Interview Questions — With Complete Answers

Q1. What is the difference between structured and semi-structured data? Give an example of each from a real application.

Structured data has a fixed, predefined schema enforced by an external system. Every record has the same fields in the same positions with the same types. Nothing can be stored that violates the schema — the system rejects it. A relational database table is the canonical example: a PostgreSQL orders table defines exactly which columns exist and their types. Every row must conform. Semi-structured data is self-describing — the structure is carried within the data itself rather than enforced externally. Fields can be added, removed, or changed without a schema migration. The data is organised but not uniformly so. A JSON event payload from a mobile app is a clear example: the app might include a promo_code field when a promo was applied and omit it entirely when there was none. The JSON still parses correctly in both cases. From a Stripe context: the transaction table in their operational database is structured — txn_id, sender_id, amount, status, timestamp are fixed columns enforced by the database. The transaction_metadata field stored alongside it is semi-structured — a JSON blob containing the payment rail details, device fingerprint, and UPI reference that varies by transaction type and changes as new payment methods are added. Both coexist in the same system serving different needs.

Q2. You receive a JSON file from a vendor where some records have a nested address object and some records have the address fields at the top level. How do you handle this?

This is a schema inconsistency problem — two different record layouts in the same dataset — and it needs to be handled explicitly at the ingestion or transformation stage rather than assumed away. My first step is to understand the scope: how many records have each format? If 95% have one format and 5% have another, the minority format is likely a legacy format from an older API version. I would confirm with the vendor whether the inconsistency is intentional (transitioning between formats) or a bug on their end. For the transformation logic, I write code that checks which format each record uses and normalises both to the same structure. In Python: city = record.get('address', {}).get('city') or record.get('city') This uses the nested path first, falls back to the flat path if the nested structure is absent. In SQL (Snowflake): COALESCE(JSON_EXTRACT_SCALAR(payload, '$.address.city'), JSON_EXTRACT_SCALAR(payload, '$.city')) I also add a validation check that flags records where neither path resolves to a non-null value — those are records with genuinely missing city data that need to be investigated separately. Long term, I document this dual-format handling and include it in the data contract with the vendor, requesting they standardise to one format. The normalisation logic becomes a maintenance liability if it persists indefinitely.

Q3. How would you approach building a pipeline to process customer reviews (free text) and make the sentiment available for analysis?

This is an unstructured-to-structured pipeline — raw text as input, structured sentiment scores as output. I would design it in three stages. Stage one is ingestion and storage. Customer reviews come from the product database as raw text strings. I ingest them into the Bronze layer as-is, along with the structured metadata I already have: review_id, product_id, customer_id, rating (1–5 stars), timestamp, and language. The raw text is stored alongside its structured context. Stage two is processing. I build a processing pipeline that reads unreviewed records from Bronze, runs them through a sentiment model — either a pre-trained model like a fine-tuned BERT variant for product reviews, or the OpenAI API for simpler use cases — and produces structured output per review: sentiment_label (positive/neutral/negative), sentiment_score (0–1 continuous), primary_topic (delivery/product_quality/customer_service/pricing extracted by topic modelling), and language_detected. Stage three is storage in Silver and Gold. The structured model output joins with the original structured metadata in the Silver layer: one row per review with all fields. In the Gold layer, I aggregate: daily sentiment scores per product category, weekly trending topics, sentiment correlation with star rating to validate model quality, and alert thresholds for sudden sentiment drops on a product. The key engineering decisions are: how to handle the processing cost (run in batch nightly rather than in real-time unless live monitoring is needed), how to version the model so re-processing with a new model is possible, and how to handle multilingual reviews (Tamil, Hindi, Kannada alongside English) which may require different model versions.

Q4. A new column is added to a source database table without notifying the data team. How does this break your pipeline and how do you prevent it?

The impact of an unannounced schema change depends entirely on how the ingestion was designed. If the ingestion uses SELECT * to read all columns and writes everything to a schema-on-read destination like S3 Parquet or a Snowflake VARIANT column, the new column appears automatically in new data without breaking anything. This is the most resilient approach for sources that change frequently. If the ingestion uses an explicit column list — SELECT order_id, customer_id, amount, status FROM orders — the new column is silently ignored. No error, no data, just a missing field. This is the most common form of silent data loss in structured ingestion pipelines. If the ingestion writes to a strict destination schema (a relational database with defined columns) and the source schema does not match the destination, the pipeline either fails with a column mismatch error or succeeds but drops the new column depending on the tool configuration. Prevention requires two mechanisms. First, schema comparison monitoring: on each pipeline run, compare the source schema against the last known schema and alert when they differ. This can be as simple as storing the column list from the previous run and comparing it to the current run. Tools like Great Expectations support schema validation natively. Second, a data contract with the source team: any schema change to a table that data engineering ingests must be communicated and reviewed before deployment. This is a process control, not a technical one, but it is the most effective prevention because it catches changes before they reach production.

Q5. What is JSON flattening and when would you choose NOT to flatten a nested JSON field?

JSON flattening is the process of converting a hierarchical nested JSON structure into a flat relational table structure — extracting nested fields into their own columns so analysts can query them with simple SQL without using JSON functions. For example, a customer.address.city nested field gets flattened into a top-level city column. An items array containing multiple items gets exploded into multiple rows — one per item — joined to the parent order via order_id. The result is a structure that looks like a traditional relational table and can be queried with basic SQL. I would choose not to flatten in four situations. First, when the nested structure is deeply hierarchical and rarely queried. If a JSON payload contains a nested configuration object accessed only by one report per quarter, flattening it into 30 columns that are mostly NULL adds schema clutter without proportionate analytical value. Keep it as a JSON column, accessible to those who need it via JSON functions. Second, when the schema changes frequently. If a vendor changes their JSON structure every few weeks, maintaining a flattening transformation that tracks every change is expensive. Storing as JSON and letting analysts access fields they need with JSON_EXTRACT is more resilient. Third, when the array explosion multiplies rows in a way that distorts aggregations. If an order has 20 items and you explode into item rows, any aggregation at the order level requires de-duplication or GROUP BY. Keeping the items array as a JSON column and using UNNEST only when needed prevents this distortion in the main orders table. Fourth, when storage cost of the repeated keys is a concern. JSON stores keys with every record. At very high scale, the repeated key names in JSON use significant storage. But at most companies this is not a primary concern compared to query convenience.

// Error Library

Errors You Will Hit — And Exactly Why They Happen

KeyError: 'promo_code' — when processing JSON events from the mobile app

Cause: The code accesses a JSON field with direct key syntax: event["promo_code"]. This works for events where the key exists but raises a KeyError for events where promo_code was never included in the payload — because the order had no promo applied. Semi-structured data allows optional fields; direct key access assumes all fields are always present.

Fix: Use .get() with a default: event.get("promo_code") returns None if the key is absent instead of raising an error. For nested fields: event.get("customer", {}).get("address", {}).get("city"). In SQL: use JSON_EXTRACT_SCALAR with COALESCE to handle missing keys. Document which fields are always present (required) and which are optional in your pipeline specification.

PySpark AnalysisException: cannot resolve 'items' given input columns — after exploding a JSON array

Cause: The column name used in the explode() call does not match the actual field name in the JSON schema after parsing. This happens when JSON keys use camelCase (orderItems) but the code expects snake_case (order_items), or when the JSON structure was nested differently than expected and the array is inside another object.

Fix: Print the schema of the parsed DataFrame before writing transformation code: df.printSchema(). This shows the exact field names and nesting as Spark sees them. Match your column references exactly to this schema. If the schema varies between records, use schema inference with a sample of records and validate the result before processing the full dataset.

ValueError: could not convert string to float: 'N/A' — when processing a CSV from a vendor

Cause: The vendor uses "N/A" as their null indicator instead of an empty field. When Pandas or Spark tries to cast the amount column to float, "N/A" is not a valid float representation. This is a CSV format irregularity — the vendor did not follow the convention of leaving the field empty for null values.

Fix: Specify the null value indicator when reading the CSV: pd.read_csv("file.csv", na_values=["N/A", "NULL", "null", "-", "n/a"]). In Spark: spark.read.option("nullValue", "N/A").csv("path"). Investigate all the null representations the vendor uses (ask them, or inspect a sample of the file) and include all of them in the na_values list. Validate after loading that null percentages are within expected ranges.

Pipeline produces 3× more rows than expected after joining orders to product_catalogue

Cause: The product catalogue is stored as a JSON document store (MongoDB) where each product has multiple variant documents — size S, M, L of the same SKU each have their own document. When joined to orders on product_id, each order row matches three catalogue rows, tripling the output. This is a fan-out join caused by not understanding the cardinality of the semi-structured source.

Fix: Always check the cardinality of your join keys before writing a JOIN in a pipeline. Run SELECT product_id, COUNT(*) FROM product_catalogue GROUP BY product_id HAVING COUNT(*) > 1 to identify multi-row keys. For this case: pre-aggregate the catalogue to one row per product_id (selecting the active or default variant) before joining to orders. Or join on a more specific key (product_id + variant_id) if that data is available in the orders table.

OCR pipeline extracts garbled text from scanned PDFs — characters appear as symbols or wrong letters

Cause: The scanned PDFs have low resolution (below 150 DPI), were scanned at an angle, or contain printed fonts that the OCR engine was not trained on. OCR (Optical Character Recognition) quality degrades significantly with image quality — a 72 DPI mobile phone photo of a document produces far worse results than a 300 DPI flatbed scanner output.

Fix: Add image pre-processing before OCR: deskew the image (correct rotation), enhance contrast, upscale to 300 DPI if below threshold. Use image quality scoring to reject documents that are below a confidence threshold and route them to manual review rather than storing garbled text as if it were reliable data. Log the OCR confidence score alongside the extracted text so downstream consumers know the reliability of the data.

🎯 Key Takeaways

✓Every data source belongs to one of three categories: Structured (fixed schema, enforced), Semi-Structured (self-describing, flexible), or Unstructured (no schema, requires interpretation). Identifying the category immediately tells you which tools, storage systems, and transformation approaches to use.
✓Structured data has a rigid schema enforced externally — relational databases, CSVs with headers. It is easy to query with SQL but breaks pipelines when the schema changes. Always monitor source schema and handle changes explicitly.
✓CSV looks structured but is not enforced — nothing prevents wrong column counts, wrong types, or inconsistent null representations. Always validate CSV structure before processing. Never assume vendor CSVs are well-formed.
✓Semi-structured data is self-describing — JSON, XML, Avro, log files. Its strength is flexibility; its risk is that the structure can change at any time without warning. Use .get() with defaults in Python, COALESCE in SQL, and always validate schema against expectations.
✓JSON arrays must be explicitly exploded (one row per array element) when relational representation is needed. Always check join cardinality before exploding — unexpected fan-out multiplies rows and corrupts aggregations.
✓Unstructured data (images, audio, video, free text, PDFs) has no queryable schema. Data engineers store it in object storage, extract available metadata as structured fields, and build processing pipelines that use ML models to produce structured outputs.
✓The correct storage for each category: structured data for analytics goes to data warehouses; semi-structured and all raw data goes to object storage (S3/ADLS); operational structured data goes to relational databases; unstructured data always goes to object storage.
✓The flattening decision for nested JSON: flatten frequently-queried fields into columns, keep rarely-accessed deep nesting as JSON columns. Flatten too much and schema maintenance becomes burdensome. Flatten too little and analysts cannot query data without JSON functions.
✓Schema change monitoring is mandatory for structured and semi-structured sources. Compare source schema against the last known schema on every pipeline run. Alert when they differ. Never let a schema change silently drop data.
✓Real companies handle all three data types simultaneously. Amazon has structured order tables, semi-structured product catalogue JSON, and unstructured product images — all flowing through different pipeline paths that converge in the Gold layer. A data engineer must be fluent in all three.

What comes next

Module 08 covers every major data format — not just what each one is, but what it costs in storage and compute, and what breaks when you choose the wrong format for your workload.

Module 08 → Data Formats — CSV, JSON, Parquet, Avro, ORC

What to learn next

Schemas, Tables & Keys

Data Engineering · 12 min · +150 XP

SQL for Data Engineers

Data Engineering · 20 min · +150 XP

Data Formats

Data Engineering

Schemas, Tables & Keys

Data Engineering

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub