Structured, Semi-Structured and Unstructured Data
The three categories every data engineer works with — what each demands from your pipeline.
The Three Categories — Why Every Data Engineer Must Know Them
Walk into any data engineering interview and say "I can handle all types of data" — you will be asked to prove it immediately. Because the three categories of data require fundamentally different storage systems, different parsing approaches, different transformation techniques, and different infrastructure choices. Confusing them in a pipeline causes silent bugs. Choosing the wrong storage for a category creates performance problems that cannot be fixed without rebuilding.
Every data source you will ever encounter belongs to one of three categories. The moment you identify which category a new data source belongs to, you know which tools to reach for, which ingestion approach to use, and which complications to expect. That pattern recognition is what this module builds.
Structured Data
Structured data has a fixed, predefined schema. Every record has exactly the same fields, every field has a known data type, and every value occupies a known position. This rigidity is what makes structured data easy to query, easy to validate, and easy to analyse — but also what makes it break when reality does not fit neatly into rows and columns.
What makes it "structured"
Structure means: before you read a single row of data, you know exactly what you will find. A relational database table defining orders has columns named order_id, customer_id, order_amount, and created_at. Every row has exactly these four fields. order_id is always an integer. order_amount is always a decimal. created_at is always a timestamp. The schema is the contract, and the database enforces it.
Table: orders (PostgreSQL)
Schema (the contract):
order_id BIGINT NOT NULL PRIMARY KEY
customer_id BIGINT NOT NULL
restaurant_id INTEGER NOT NULL
order_amount DECIMAL(10,2) NOT NULL
status VARCHAR(20) NOT NULL CHECK (status IN ('placed','confirmed','delivered','cancelled'))
created_at TIMESTAMP NOT NULL DEFAULT NOW()
Row 1: [9284751, 4201938, 7823, 380.00, 'delivered', '2026-03-17 20:14:32']
Row 2: [9284752, 1092847, 2341, 220.00, 'cancelled', '2026-03-17 20:15:01']
Row 3: [9284753, 8374621, 7823, 540.00, 'confirmed', '2026-03-17 20:15:12']
Properties:
✓ Every row has exactly the same columns
✓ Each column has a fixed data type enforced by the database
✓ Invalid data is REJECTED — try inserting order_amount='abc' and it fails
✓ NULL constraints prevent missing required fields
✓ CHECK constraints enforce business rules (valid status values)
This predictability is what makes SQL queries fast and reliable.
A query engine knows exactly where each field is without parsing.Where structured data lives
Structured data primarily lives in relational databases — PostgreSQL, MySQL, SQL Server, Oracle — and in structured file formats like CSV (Comma-Separated Values) and TSV (Tab-Separated Values). Columnar formats like Parquet and ORC are also structured — they have a defined schema — but they store data in a columnar layout optimised for analytics rather than the row-oriented layout of operational databases.
What structured data demands from your pipeline
The CSV trap — structured format, unstructured reality
CSV files look like structured data — they have rows and columns, often with a header row that names the columns. But CSV has no schema enforcement. Nothing stops a row from having more or fewer columns than the header. Nothing enforces data types — everything is text. Nothing prevents nulls from appearing in any field. A data engineer who treats CSV as reliably structured has not yet encountered a CSV file from a vendor, a finance team, or an ERP export.
"correct" CSV file:
order_id,customer_id,amount,status
9284751,4201938,380.00,delivered
9284752,1092847,220.00,cancelled
Reality of vendor-supplied CSV files:
Row with wrong column count:
9284753,8374621,540.00,confirmed,extra_column_nobody_warned_you_about
Row with wrong data type:
9284754,4201938,N/A,delivered ← amount is text, not decimal
Row with ambiguous null:
9284755,,380.00,delivered ← customer_id is empty — null or 0?
Row with embedded delimiter:
9284756,4201938,"1,380.00",delivered ← amount contains a comma
needs quoting — did the exporter handle it?
Row with encoding issue:
9284757,4201938,380.00,délivré ← accented character — what encoding?
Header row missing entirely:
9284758,4201938,540.00,confirmed ← is row 1 data or headers?
Pipeline rule: always validate CSV structure before processing.
Never assume a CSV is well-formed because it came from a "reliable" source.Semi-Structured Data
Semi-structured data has some organisation — fields have names, values have types — but the structure is not enforced by an external schema. The data carries its structure within itself. This "self-describing" property is both its greatest strength and its greatest source of engineering complexity.
The strength: semi-structured formats can represent hierarchical, nested, and variable-length data that relational tables cannot. A single JSON object can represent an order with a nested array of items, a nested customer object, and optional fields that only appear for some order types — all in one document without needing six related tables.
The complexity: because the structure is self-declared rather than enforced, it can change at any time. A field that was always present in past records can be absent in new ones. A field that was always a string becomes an array. A new nested object appears. None of these changes cause an error at the source — but all of them can silently break your downstream pipeline.
JSON — the dominant semi-structured format
JSON (JavaScript Object Notation) is the most common data format a data engineer encounters. Every REST API returns JSON. Every modern event stream carries JSON payloads. Every NoSQL database stores JSON documents. Understanding JSON deeply is not optional.
A single Swiggy order as JSON — what an API actually returns:
{
"order_id": 9284751,
"created_at": "2026-03-17T20:14:32+05:30",
"customer": {
"id": 4201938,
"name": "Priya Sharma",
"phone": "+91-9876543210",
"address": {
"flat": "4B",
"building": "Prestige Meridian",
"area": "Koramangala",
"city": "Bangalore",
"pincode": "560034",
"lat": 12.9352,
"lng": 77.6245
}
},
"restaurant": {
"id": 7823,
"name": "Punjabi Dhaba",
"city": "Bangalore"
},
"items": [
{
"id": "MI-001",
"name": "Butter Chicken",
"quantity": 1,
"unit_price": 320.00,
"customisations": ["extra gravy"]
},
{
"id": "MI-002",
"name": "Garlic Naan",
"quantity": 2,
"unit_price": 30.00,
"customisations": []
}
],
"payment": {
"method": "UPI",
"upi_id": "priya@ybl",
"amount": 380.00,
"status": "captured"
},
"delivery": {
"agent_id": 83921,
"estimated_time_mins": 35,
"actual_time_mins": 42
},
"promo_code": null
}
Engineering challenges this single JSON creates:
1. NESTING — customer.address.city is 3 levels deep.
To get city into a flat table you must flatten the hierarchy.
SQL: customer['address']['city'] or JSON_EXTRACT()
2. ARRAYS — items is a list of variable length.
1 order = 1 JSON object but potentially many items rows.
Must EXPLODE/UNNEST the array to get one row per item.
3. OPTIONAL FIELDS — promo_code is null here.
Other orders might not have the promo_code key at all.
Code that does order['promo_code'] crashes on those.
Code that does order.get('promo_code') handles it.
4. TYPE INCONSISTENCY — actual_time_mins might be null
for orders still in progress. Your pipeline must handle
both integer and null for the same field.
5. TIMESTAMP FORMAT — "2026-03-17T20:14:32+05:30"
is ISO 8601 with timezone. Some records might have
"2026-03-17 20:14:32" (no timezone). Parse explicitly.The flattening problem — from nested JSON to flat tables
Relational databases and data warehouses store data in flat tables — rows and columns with no nesting. JSON is hierarchical. The process of converting nested JSON into flat table rows is called flattening ornormalisation, and it is one of the most common transformation tasks in data engineering.
Input: 1 JSON order object (as shown above)
Output: multiple flat table rows
Table: orders (one row per order)
order_id | created_at | customer_id | restaurant_id | total_amount | payment_method | promo_code
9284751 | 2026-03-17 20:14:32 | 4201938 | 7823 | 380.00 | UPI | NULL
Table: order_items (one row per item — EXPLODED from items array)
order_id | item_id | item_name | quantity | unit_price | customisations
9284751 | MI-001 | Butter Chicken | 1 | 320.00 | extra gravy
9284751 | MI-002 | Garlic Naan | 2 | 30.00 | (empty)
Table: customers (one row per customer — EXTRACTED from nested customer object)
customer_id | name | phone | city | pincode | lat | lng
4201938 | Priya Sharma | +91-9876543210 | Bangalore | 560034 | 12.9352 | 77.6245
Table: deliveries (one row per delivery)
order_id | agent_id | estimated_mins | actual_mins
9284751 | 83921 | 35 | 42
Key decision: how deep to flatten?
Deep flatten: extract every nested field into its own column
→ Easy for analysts to query, but breaks when nested structure changes
Shallow flatten: keep some nesting as JSON column in warehouse
→ More resilient to schema changes, but requires JSON functions to query
Most teams use a hybrid: flatten the top-level fields you know you need,
keep rarely-needed deep nesting as a JSON column for flexibility.XML — the older semi-structured format
XML (Extensible Markup Language) predates JSON and is more verbose — the same data takes roughly 3–4× more bytes to represent. But XML is still heavily used in enterprise systems, government data exchanges, healthcare (HL7 FHIR), and financial data (SWIFT, FIX protocol). If you work with legacy enterprise clients or government data sources, you will encounter XML regularly.
<?xml version="1.0" encoding="UTF-8"?>
<order id="9284751">
<created_at>2026-03-17T20:14:32+05:30</created_at>
<customer id="4201938">
<name>Priya Sharma</name>
<address>
<city>Bangalore</city>
<pincode>560034</pincode>
</address>
</customer>
<items>
<item id="MI-001">
<name>Butter Chicken</name>
<quantity>1</quantity>
<unit_price>320.00</unit_price>
</item>
<item id="MI-002">
<name>Garlic Naan</name>
<quantity>2</quantity>
<unit_price>30.00</unit_price>
</item>
</items>
<payment method="UPI" amount="380.00" status="captured"/>
</order>
XML-specific engineering challenges:
→ Namespaces: <ns0:order xmlns:ns0="http://schema.example.com/orders">
Different XML documents use namespace prefixes differently.
Always strip or normalise namespaces before parsing.
→ Attributes vs elements: both <amount>380</amount> and
<payment amount="380"/> are valid XML for the same data.
Your parser must handle both.
→ CDATA sections: text content that may contain special characters
is wrapped in <![CDATA[...]]> — must be unwrapped before use.
→ Large XML files cannot be loaded into memory entirely.
Use streaming parsers (SAX in Python) for files > a few MB.Log files — semi-structured but barely
Application log files are technically semi-structured — each line has a timestamp, a severity, and a message — but the structure is inconsistently applied and the message content is free text. Log files are the most challenging form of semi-structured data because the "schema" is defined by developers writing log statements, not by a formal specification.
Raw application log (what you receive):
2026-03-17 20:14:32.847 INFO [OrderService] Order 9284751 placed by customer 4201938
2026-03-17 20:14:33.012 INFO [PaymentService] Payment captured: ₹380.00 via UPI
2026-03-17 20:14:33.241 DEBUG [RestaurantService] Notifying restaurant 7823
2026-03-17 20:15:02.119 WARN [OrderService] Delivery estimate exceeding threshold: order 9284751
2026-03-17 20:16:41.003 ERROR [PaymentService] Refund failed for order 9284754: timeout after 3s
Parsing this requires:
1. Regex to extract timestamp, level, service, message
2. Further parsing of message field to extract IDs and values
3. Handling log lines that span multiple lines (stack traces)
4. Handling encoding variations (₹ symbol)
5. Dealing with log rotation and multiple log files per hour
Pattern:
^(d{4}-d{2}-d{2} d{2}:d{2}:d{2}.d{3})s+
(INFO|DEBUG|WARN|ERROR)s+
[([^]]+)]s+(.+)$
Structured log (what modern applications emit — much easier):
{"timestamp":"2026-03-17T20:14:32.847Z","level":"INFO",
"service":"OrderService","event":"order_placed",
"order_id":9284751,"customer_id":4201938}
If you have influence over how your company's services log —
always push for structured JSON logging. It eliminates regex.What semi-structured data demands from your pipeline
Unstructured Data
Unstructured data has no predefined schema and no inherent organisation that a machine can automatically interpret. It is raw content — the bytes of an image, the waveform of an audio file, the free text of a customer review, the pages of a PDF contract. 80% of all data generated in the world is unstructured.
For most of data engineering history, unstructured data was stored but largely ignored — too difficult to process at scale without the ML tools to interpret it. That has changed dramatically in the past three years. Every major Indian tech company now has projects that extract structured signals from unstructured data: sentiment from customer reviews, fraud signals from transaction descriptions, product categories from listing images, insights from call recordings.
Types of unstructured data and where they appear
How data engineers handle unstructured data
A data engineer's relationship with unstructured data has three distinct responsibilities, each requiring different skills and tools.
RESPONSIBILITY 1: Store and Organise
Unstructured data must be stored in object storage (S3, ADLS, GCS)
with a clear organisation scheme so it can be found and processed.
Bad: s3://company-data/uploads/file1.jpg
Good: s3://company-data/raw/product-images/category=electronics/
sku=SKU-00283741/2026-03-17/image_001.jpg
Good organisation includes:
→ Meaningful key prefix (not random UUIDs)
→ Partition by date for time-range access
→ Partition by category/type for selective processing
→ Consistent naming convention
RESPONSIBILITY 2: Extract Metadata
Even before ML processing, extract the metadata that IS structured:
For images: file_size, format (JPEG/PNG), dimensions (width×height),
creation_date, source_system, associated_entity_id
For audio: file_size, format (MP3/WAV), duration_seconds, sample_rate,
call_id, agent_id, customer_id, call_timestamp
For documents: file_size, page_count, creation_date, author (if available),
document_type, associated_case_id
This metadata is structured and immediately useful for filtering,
capacity planning, and processing queue management.
RESPONSIBILITY 3: Build Processing Pipelines
Orchestrate the ML models or transformation tools that convert
unstructured content into structured signals.
Image pipeline:
Raw image → resize/normalise → ML model → structured output
{sku: "SKU-283741", category: "electronics", has_watermark: false,
background: "white", product_visible: true, quality_score: 0.94}
Customer review pipeline:
Raw text → clean/normalise → NLP model → structured output
{review_id: 8734, sentiment: "negative", score: 0.23,
topics: ["delivery", "packaging"], entities: ["Bangalore"],
urgency: "high", requires_response: true}
Call recording pipeline:
Audio file → speech-to-text → transcript → NLP → structured output
{call_id: "CALL-83921", duration_secs: 247, transcript: "...",
sentiment_overall: "frustrated", resolution: "refund_initiated",
agent_empathy_score: 0.78}The data lakehouse for unstructured data
Unstructured data cannot live in a data warehouse — warehouses store tables, not images or audio files. Object storage (S3, ADLS, GCS) is the only viable storage at scale for unstructured data. The data engineer's job is to build a metadata layer that makes the unstructured data discoverable and processable alongside the structured and semi-structured data.
The practical pattern: store raw unstructured data in object storage, extract metadata into a structured table, and store ML model outputs as structured data in the warehouse. The warehouse contains the derived structured signals; the raw unstructured data lives in the lake. Analysts query the warehouse; ML engineers query the lake directly.
All Three Categories — The Complete Comparison
| Dimension | Structured | Semi-Structured | Unstructured |
|---|---|---|---|
| Schema | Fixed, enforced by external system | Self-describing, flexible — carried within the data | None — no inherent organisation |
| Examples | SQL tables, CSV with headers, Excel | JSON, XML, Avro, Parquet, log files | Images, audio, video, free text, PDFs, emails |
| Primary storage | Relational DB, Data Warehouse | Object store (S3/ADLS) + document store | Object store only (S3/ADLS/GCS) |
| Query method | SQL | SQL + JSON functions | ML models only — cannot SQL-query raw pixels |
| Schema change | Hard — breaks pipelines | Flexible — new fields just appear | N/A — no schema to change |
| Best for | Transactional data, reporting, analytics | Events, APIs, configs, logs, hierarchical data | Content, media, documents, sensor raw streams |
| Transformation | SQL CTEs, dbt models | Flatten, explode, parse, normalise | OCR, speech-to-text, image classification, NLP, embedding |
| Warehouse ready? | Yes — native table storage | Needs flattening first | No — store outputs of ML processing |
| Compression | Moderate (columnar storage 5–10×) | Very high (repetitive keys compress well) | Format-dependent (images: 10–20×, audio: 5–10×) |
| Pipeline failures | Schema drift, type mismatches, NULL propagation | Missing fields, type changes, array explosion | Model drift, storage costs, processing time |
Storage Implications — Which Category Goes Where
The category of data determines which storage system is appropriate. Choosing the wrong storage for a data category does not cause an immediate error — you can technically store anything anywhere. But the wrong choice creates query performance problems, cost inefficiencies, and maintenance nightmares that compound over months and years.
How Indian Companies Handle All Three Types
Every major Indian tech company deals with all three data categories simultaneously. Here is what that looks like in practice at three different types of organisations.
Designing a Pipeline That Handles All Three Types at Once
Your manager gives you a task: "Build a pipeline that ingests all data from our new claim processing workflow and makes it available for analysis. The workflow generates a claim record in PostgreSQL, a JSON event payload from the mobile app, and a scanned PDF of the supporting document."
One business process, three data types. Here is how you approach it.
Step 1 — Classify each source: The PostgreSQL claim record is structured — fixed schema, enforced types, reliable. The JSON event from the app is semi-structured — self-describing but potentially inconsistent across app versions. The scanned PDF is unstructured — binary content requiring OCR before any fields can be extracted.
Step 2 — Design separate ingestion paths for each type:The PostgreSQL record goes through standard incremental SQL extraction into the Bronze layer as Parquet. The JSON event goes through a Kafka consumer that reads the event stream and lands raw JSON files in the Bronze layer. The PDF goes through an S3 PUT event trigger that fires an OCR pipeline when a new document is uploaded.
Step 3 — Design converging transformation: In the Silver layer, all three inputs transform into structured tables. The PostgreSQL data cleans directly into a claims_silver table. The JSON event is flattened — claim_id, device_type, app_version, GPS coordinates, timestamp — into a claim_events_silver table. The OCR output extracts document_type, issue_date, insured_name, and document_verified flag into a claim_documents_silver table.
Step 4 — Join in Gold: A single Gold table joins all three Silver tables on claim_id, giving analysts one row per claim with all structured fields from all three sources. The raw PDF is linked via a URL column pointing to S3 for auditors who need to view the original document.
The lesson: A data engineer who only knows one data type is blocked at step one. A data engineer who understands all three can design the architecture, identify the tools needed for each path, and deliver a unified output that hides the complexity from the analyst who just sees a clean SQL table.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Every data source belongs to one of three categories: Structured (fixed schema, enforced), Semi-Structured (self-describing, flexible), or Unstructured (no schema, requires interpretation). Identifying the category immediately tells you which tools, storage systems, and transformation approaches to use.
- ✓Structured data has a rigid schema enforced externally — relational databases, CSVs with headers. It is easy to query with SQL but breaks pipelines when the schema changes. Always monitor source schema and handle changes explicitly.
- ✓CSV looks structured but is not enforced — nothing prevents wrong column counts, wrong types, or inconsistent null representations. Always validate CSV structure before processing. Never assume vendor CSVs are well-formed.
- ✓Semi-structured data is self-describing — JSON, XML, Avro, log files. Its strength is flexibility; its risk is that the structure can change at any time without warning. Use .get() with defaults in Python, COALESCE in SQL, and always validate schema against expectations.
- ✓JSON arrays must be explicitly exploded (one row per array element) when relational representation is needed. Always check join cardinality before exploding — unexpected fan-out multiplies rows and corrupts aggregations.
- ✓Unstructured data (images, audio, video, free text, PDFs) has no queryable schema. Data engineers store it in object storage, extract available metadata as structured fields, and build processing pipelines that use ML models to produce structured outputs.
- ✓The correct storage for each category: structured data for analytics goes to data warehouses; semi-structured and all raw data goes to object storage (S3/ADLS); operational structured data goes to relational databases; unstructured data always goes to object storage.
- ✓The flattening decision for nested JSON: flatten frequently-queried fields into columns, keep rarely-accessed deep nesting as JSON columns. Flatten too much and schema maintenance becomes burdensome. Flatten too little and analysts cannot query data without JSON functions.
- ✓Schema change monitoring is mandatory for structured and semi-structured sources. Compare source schema against the last known schema on every pipeline run. Alert when they differ. Never let a schema change silently drop data.
- ✓Real companies handle all three data types simultaneously. Flipkart has structured order tables, semi-structured product catalogue JSON, and unstructured product images — all flowing through different pipeline paths that converge in the Gold layer. A data engineer must be fluent in all three.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.