Data Formats — CSV, JSON, Parquet, Avro, ORC
How each format works internally, when to use it, and what breaks when you pick the wrong one.
The Format Decision That Changed Everything
In 2012, a team at Twitter needed to process billions of events per day. They were using CSV files. Queries that should have taken minutes took hours. Storage costs were enormous. They switched to a new columnar format called Parquet. Query time dropped by 90%. Storage cost dropped by 75%. They published the results. The industry never went back.
The file format you choose determines four things: how much disk space your data takes, how fast queries run against it, whether your pipeline can handle schema changes, and whether different tools can read it. Choosing the wrong format does not cause an error — it causes slow pipelines, high cloud bills, and architectural pain that compounds over years.
Every data engineer needs to know five formats fluently: CSV,JSON, Parquet, Avro, andORC. Not because you will use all five in every project — you will not — but because you need to know which one is right for each situation and why.
Same dataset: 4 orders
ORDER DATA:
order_id | customer_id | city | amount | status
─────────────────────────────────────────────────────
9284751 | 4201938 | Bangalore | 380.00 | delivered
9284752 | 1092847 | Mumbai | 220.00 | cancelled
9284753 | 8374621 | Hyderabad | 540.00 | confirmed
9284754 | 2938471 | Bangalore | 180.00 | delivered
ROW FORMAT (CSV / Avro):
Disk block 1: [9284751, 4201938, Bangalore, 380.00, delivered]
Disk block 2: [9284752, 1092847, Mumbai, 220.00, cancelled]
Disk block 3: [9284753, 8374621, Hyderabad, 540.00, confirmed]
Disk block 4: [9284754, 2938471, Bangalore, 180.00, delivered]
Query: SELECT SUM(amount) FROM orders
Must read: ALL 5 columns × 4 rows = 20 values read, 4 needed
COLUMNAR FORMAT (Parquet / ORC):
Column "order_id": [9284751, 9284752, 9284753, 9284754]
Column "customer_id": [4201938, 1092847, 8374621, 2938471]
Column "city": [Bangalore, Mumbai, Hyderabad, Bangalore]
Column "amount": [380.00, 220.00, 540.00, 180.00]
Column "status": [delivered, cancelled, confirmed, delivered]
Query: SELECT SUM(amount) FROM orders
Reads: ONLY the "amount" column = 4 values read, 4 needed
10-100× less I/O for analytical queries at scaleComma-Separated Values
CSV is the oldest, simplest, and most universal data format. Every system can read and write it. A CSV file is plain text — open it in Notepad and you see exactly what is in it. No binary encoding, no special software needed, no schema file required. This simplicity is why CSV has survived for 50 years and will survive 50 more.
How CSV works internally
A CSV file is a sequence of lines. Each line is one record. Within each line, values are separated by a delimiter — usually a comma, sometimes a tab (TSV), semicolon, or pipe. The first line is optionally a header row naming the columns. That is the entire specification.
orders.csv — opened in a text editor:
order_id,customer_id,city,amount,status
9284751,4201938,Bangalore,380.00,delivered
9284752,1092847,Mumbai,220.00,cancelled
9284753,8374621,Hyderabad,540.00,confirmed
What the file actually is on disk:
order_id,customer_id,city,amount,status
9284751,4201938,Bangalore,380.00,delivered
9284752,1092847,Mumbai,220.00,cancelled
= newline character (line ending)
Every value is a string — there are no types
380.00 is not a number — it is the characters '3','8','0','.','0','0'
The application reading the file decides what type to interpret it as
Quoting rule (RFC 4180):
If a value contains the delimiter, wrap it in double quotes:
9284755,4201938,"Mumbai, Maharashtra",380.00,delivered
────────────────────
quoted because it contains a comma
If a value contains double quotes, escape them by doubling:
9284756,4201938,"Hotel ""Grand"" Mumbai",380.00,deliveredWhat CSV does well and where it breaks
When to use CSV
CSV is the right choice for data exchange between systems where simplicity and universal compatibility matter more than performance. Vendor data exports, finance team reports, government open data, and one-time data loads are all appropriate CSV use cases. Never use CSV as the storage format inside a data lake or warehouse — convert to Parquet at the Bronze layer. Never use CSV for large analytical datasets — query performance is poor and storage cost is high compared to columnar alternatives.
JavaScript Object Notation
JSON is the lingua franca of the internet. Every REST API speaks JSON. Every modern web application exchanges JSON. Every NoSQL database stores JSON. As a data engineer, you will read, parse, validate, flatten, and transform more JSON than any other format. Module 07 covered JSON structure in depth — here we focus on JSON as a storage and transfer format.
JSON as a file format — NDJSON and JSON Lines
A single JSON object is fine for one record. But how do you store a million records? A single JSON array containing a million objects works but has a critical problem: you must read the entire file before you can parse any record, because the array is not complete until the closing bracket. This makes streaming and chunked processing impossible.
The solution used in data engineering is NDJSON (Newline Delimited JSON) or JSON Lines — one complete JSON object per line, with a newline between records. This allows streaming parsers to read one record at a time without loading the entire file.
WRONG for large datasets — single JSON array:
[
{"order_id": 9284751, "amount": 380.00, "city": "Bangalore"},
{"order_id": 9284752, "amount": 220.00, "city": "Mumbai"},
... 999,998 more records ...
{"order_id": 10284750, "amount": 540.00, "city": "Hyderabad"}
]
Problem: entire file must be read before any record can be parsed.
1 GB file = hold 1 GB in memory to read record 1.
CORRECT for large datasets — NDJSON / JSON Lines:
{"order_id": 9284751, "amount": 380.00, "city": "Bangalore"}
{"order_id": 9284752, "amount": 220.00, "city": "Mumbai"}
{"order_id": 9284753, "amount": 540.00, "city": "Hyderabad"}
Each line is a complete, valid JSON object.
Parsers read line by line — constant memory regardless of file size.
Any line can be skipped or processed independently.
Spark, Pandas, and all major DE tools support NDJSON natively.
Python reading NDJSON efficiently:
with open("orders.ndjson") as f:
for line in f: # reads one line at a time
record = json.loads(line) # parses one object at a time
process(record) # constant memory usageJSON storage cost — the key overhead
JSON stores every key name with every record. In a CSV, column names appear once in the header. In JSON, the key "order_id" is repeated for every single record. For a dataset with 20 fields and 100 million records, this key repetition adds hundreds of megabytes of overhead that carries zero information.
10 million order records stored as JSON:
Each record:
{"order_id":9284751,"customer_id":4201938,"city":"Bangalore",
"amount":380.00,"status":"delivered","created_at":"2026-03-17"}
Key name overhead per record:
"order_id" = 10 chars
"customer_id" = 13 chars
"city" = 6 chars
"amount" = 8 chars
"status" = 8 chars
"created_at" = 12 chars
Total key overhead = 57 chars per record × 10M records
= 570 MB of key names carrying zero data
Compared formats for same 10M orders:
JSON (uncompressed): ~4.2 GB
JSON (gzip): ~0.9 GB
CSV (uncompressed): ~1.8 GB
CSV (gzip): ~0.4 GB
Parquet: ~0.3 GB ← columnar + compression combined
JSON is fine for transfer. It is a poor choice for storage at scale.When to use JSON
JSON is the right format for API communication, event streaming payloads, configuration files, and the landing zone where raw API responses are preserved. Convert JSON to Parquet at the Bronze layer for all analytical storage. Never keep large datasets in JSON format in a data lake long-term — the storage cost and query performance are both significantly worse than Parquet.
Apache Parquet
Parquet is the workhorse format of modern data engineering. If you work with data lakes — Bronze, Silver, Gold layers — you work with Parquet. It is the default storage format for Spark, dbt, Databricks, and every major cloud data lake. Understanding Parquet's internals explains why it is so much faster and cheaper than CSV or JSON for analytical workloads.
Parquet internals — how it achieves its performance
Parquet files have a specific internal structure that enables two performance optimisations working together: columnar storage and column-level compression and statistics. Understanding both explains the performance numbers that seem almost impossible.
A Parquet file is divided into Row Groups.
Each Row Group contains Column Chunks.
Each Column Chunk is stored consecutively on disk.
FILE STRUCTURE:
┌─────────────────────────────────────────────────┐
│ Magic bytes: PAR1 (marks file as Parquet) │
├─────────────────────────────────────────────────┤
│ ROW GROUP 1 (e.g., rows 1 – 100,000) │
│ ├── Column Chunk: order_id │
│ │ [9284751, 9284752, 9284753 ...] │
│ │ Encoding: DELTA_BINARY_PACKED │
│ │ Compression: SNAPPY │
│ │ Min: 9284751 Max: 9384750 Nulls: 0 │
│ ├── Column Chunk: amount │
│ │ [380.00, 220.00, 540.00 ...] │
│ │ Encoding: PLAIN │
│ │ Compression: SNAPPY │
│ │ Min: 50.00 Max: 4999.00 Nulls: 0 │
│ ├── Column Chunk: city │
│ │ [Bangalore, Mumbai, Bangalore ...] │
│ │ Encoding: RLE_DICTIONARY │
│ │ Compression: SNAPPY │
│ │ Min: Bangalore Max: Pune Nulls: 0 │
│ └── Column Chunk: status │
│ [delivered, cancelled, delivered ...] │
│ Encoding: RLE_DICTIONARY (few values) │
├─────────────────────────────────────────────────┤
│ ROW GROUP 2 (rows 100,001 – 200,000) │
│ ...same structure... │
├─────────────────────────────────────────────────┤
│ FILE FOOTER (metadata for the entire file) │
│ Schema definition (column names + types) │
│ Row group statistics (min/max per column) │
│ Row group offsets (where each group starts) │
│ Magic bytes: PAR1 │
└─────────────────────────────────────────────────┘
The footer contains everything a query engine needs to decide
which row groups to read — without opening them.The two mechanisms that make Parquet fast
Mechanism 1 — Predicate pushdown using statistics. The footer of a Parquet file stores the minimum and maximum value for every column in every row group. A query engine reads the footer first — before reading any data — and uses this information to skip row groups that cannot possibly contain matching rows.
Query: SELECT SUM(amount) FROM orders WHERE city = 'Bangalore'
Parquet file has 10 row groups (1M rows each = 10M total rows)
Footer statistics for "city" column per row group:
Row Group 1: Min=Bangalore, Max=Pune → MAY contain Bangalore ✓ read
Row Group 2: Min=Chennai, Max=Delhi → CANNOT contain Bangalore ✗ skip
Row Group 3: Min=Ahmedabad, Max=Bangalore → MAY contain Bangalore ✓ read
Row Group 4: Min=Delhi, Max=Hyderabad → CANNOT contain Bangalore ✗ skip
...
Result: 6 of 10 row groups are skipped entirely.
Only 4 million rows are read, not 10 million.
Query is 60% faster before columnar storage even factors in.
Combined with columnar storage (reading only the amount and city columns):
Traditional CSV: Read 10M rows × 5 columns = 50M values
Parquet: Read 4M rows × 2 columns = 8M values
Effective speedup: ~6× from predicate pushdown × ~2.5× from columnar
= ~15× total faster than CSV for this queryMechanism 2 — Encoding and compression per column. Because all values in a column chunk have the same type and often similar values, Parquet applies specialised encodings before compression. A column of order statuses containing mostly "delivered" with occasional "cancelled" gets encoded with Run-Length Encoding (RLE) — instead of storing "delivered" 800,000 times, it stores "delivered × 800000". Then Snappy compression reduces it further. The result: a column that takes 8 MB as plain text takes 180 KB as Parquet.
ENCODING WHAT IT DOES BEST FOR
────────────────────────────────────────────────────────────────────
PLAIN Store values as-is Small columns, floats
No encoding transformation where values vary widely
DICTIONARY Build a dictionary of unique String columns with few
values, store index numbers unique values (city, status,
instead of repeating strings category, country code)
"Bangalore"→0, "Mumbai"→1
[0,1,0,0,1,0] instead of names
RLE (Run-Length Consecutive repeating values Boolean columns, sorted
Encoding) stored as (value, count) pairs columns, status fields
[delivered×847,cancelled×12]
DELTA_BINARY Store differences between Auto-increment IDs,
_PACKED consecutive values timestamps
[9284751, +1, +1, +1, +1]
instead of full values
After encoding, SNAPPY or GZIP compression is applied.
Snappy: very fast compress/decompress, moderate ratio (~2-3×)
GZIP: slower, better ratio (~4-5×) — use for archival storageParquet schema evolution
Parquet supports adding new columns without rewriting existing files. Old files simply have NULL values for the new column when read by new readers. This makes Parquet compatible with incremental schema changes — an important property for long-running data lake tables where adding a column without reprocessing all historical data is essential.
When to use Parquet
Parquet is the default choice for all analytical storage in a data lake — Bronze, Silver, and Gold layers. Use it for any dataset that will be queried with SQL, processed with Spark, or stored in S3/ADLS for analytical purposes. The only reasons not to use Parquet are: you need human-readable files (use CSV), you need to stream individual records through a message broker (use Avro), or you are working in a Hive/MapReduce ecosystem that prefers ORC.
Apache Avro
Avro is a row-oriented binary serialisation format designed specifically for the Kafka ecosystem and schema evolution. While Parquet dominates analytical storage, Avro dominates event streaming and data serialisation between services. Understanding the difference in their design goals explains why both exist and why neither can replace the other.
How Avro works — schema separation
Avro separates the schema from the data. The schema — written in JSON — defines the structure of every record: field names, types, whether fields are nullable, and default values for missing fields. The data is stored in a compact binary format that does not repeat field names. To read an Avro file, you need the schema. This schema separation is both Avro's greatest strength and its primary source of operational complexity.
Avro Schema (written in JSON, stored separately):
{
"type": "record",
"name": "Order",
"namespace": "com.freshmart.data",
"fields": [
{"name": "order_id", "type": "long"},
{"name": "customer_id", "type": "long"},
{"name": "city", "type": "string"},
{"name": "amount", "type": {"type": "bytes", "logicalType": "decimal",
"precision": 10, "scale": 2}},
{"name": "status", "type": "string"},
{"name": "promo_code", "type": ["null", "string"], "default": null}
]
}
Avro binary data (what the file actually stores):
No field names. Just values in schema-defined order:
[9284751][4201938][9:Bangalore][380.00][9:delivered][null]
[9284752][1092847][6:Mumbai] [220.00][9:cancelled][null]
Field names are looked up from the schema, not stored with data.
This makes Avro files smaller than JSON (no key repetition)
but requires schema availability to read.
Avro file format:
┌──────────────────────────────────────────────┐
│ File header: magic bytes + schema (JSON) │ ← schema embedded in file
├──────────────────────────────────────────────┤
│ Data block 1 │
│ sync marker (for splittability) │
│ encoded records (binary, compressed) │
├──────────────────────────────────────────────┤
│ Data block 2 ... │
└──────────────────────────────────────────────┘
Unlike Parquet: schema is in the file header, not the footer.
This means you can start streaming from the beginning immediately.Schema evolution — Avro's killer feature
Avro's most important feature for data engineering is its formal support for schema evolution. When a producer changes the schema of the events it sends (adds a new field, removes a field, changes a default value), Avro's schema compatibility rules determine whether existing consumers can still read the new events without breaking.
SCHEMA COMPATIBILITY RULES:
BACKWARD COMPATIBLE (new schema can read old data):
✓ Add a new field with a default value
Old: {order_id, amount, status}
New: {order_id, amount, status, delivery_time_mins: default=null}
Old consumers reading new data: see null for delivery_time_mins ✓
FORWARD COMPATIBLE (old schema can read new data):
✓ Remove a field that had a default value
Old: {order_id, amount, status, promo_code: default=null}
New: {order_id, amount, status}
Old consumers reading new data: use default for promo_code ✓
BREAKING CHANGES (never do these without coordination):
✗ Rename an existing field
✗ Change a field type (string to int)
✗ Add a field WITHOUT a default value
✗ Remove a field WITHOUT a default value
WHY THIS MATTERS:
In a Kafka pipeline, 50 microservices may consume from one topic.
If a producer changes its Avro schema in a non-backward-compatible
way, all 50 consumers break simultaneously.
Schema Registry (Confluent or AWS Glue Schema Registry) enforces
compatibility rules automatically — rejects schema changes that
would break consumers. Every production Kafka + Avro deployment
must use a Schema Registry.Avro and Kafka — why they are paired
Avro and Kafka are the standard pairing for event streaming because Avro provides exactly what Kafka needs: compact binary serialisation, schema validation, and formal schema evolution semantics. A Kafka topic carries millions of events per second — the overhead of JSON key repetition at that volume is significant. Avro's compact binary format, combined with a Schema Registry that stores schemas centrally, makes Kafka pipelines both efficient and schema-safe.
When to use Avro
Avro is the right choice for Kafka event payloads, data serialisation between services, and any pipeline where schema evolution is a primary concern. Avro is not the right choice for analytical storage — convert Avro events to Parquet when landing them in a data lake Bronze layer. Avro is for data in motion; Parquet is for data at rest.
Optimised Row Columnar
ORC is Apache Hive's columnar format — created at the same time as Parquet and with very similar goals. ORC is the default format in the Hive ecosystem and is heavily used in Amazon EMR, Apache Hive, and legacy Hadoop-based data platforms. If you join a company with an older big data stack built on Hive, you will work with ORC.
ORC vs Parquet — the practical differences
ORC and Parquet are genuinely similar — both are columnar, both use statistics-based predicate pushdown, both support compression and encoding, and both support schema evolution. The differences are in ecosystem support and a few technical details.
| Dimension | Parquet | ORC |
|---|---|---|
| Origin | Twitter + Cloudera (2013) | Hortonworks + Facebook for Hive (2013) |
| Default in | Spark, Databricks, dbt, Athena, BigQuery, Snowflake | Apache Hive, Amazon EMR (Hive mode), Presto (Hive tables) |
| Compression | Snappy (default), GZIP, ZSTD, LZO | ZLIB (default), Snappy, LZO — ORC ZLIB often gives better ratios than Parquet Snappy |
| Predicate pushdown | Row group level (min/max statistics) | Stripe level + row index (more granular, sometimes faster for highly selective filters) |
| ACID transactions | Via Delta Lake or Iceberg (not native) | Native in Hive with ORC — INSERT, UPDATE, DELETE supported in Hive tables |
| Bloom filters | Supported (opt-in) | Supported and widely used |
| Nested types | Excellent — designed for deeply nested schemas | Good — handles nesting but Parquet is more flexible |
| Best ecosystem | Cloud-native, Spark-first, modern lakehouse | Hive, legacy Hadoop, Amazon EMR Hive tables |
When to use ORC
ORC is the right choice when working with Hive tables, Apache Hive on EMR, or any environment where ORC is the established standard. For new projects on AWS, Azure, or GCP with Spark or Databricks, Parquet is the default choice. The performance difference between ORC and Parquet for most analytical workloads is small enough that ecosystem compatibility matters more than raw performance.
All Five Formats — The Complete Decision Table
Use this table to make the format decision quickly for any new pipeline or storage need.
| Property | CSV | JSON | Parquet | Avro | ORC |
|---|---|---|---|---|---|
| Storage layout | Row | Row | Columnar | Row | Columnar |
| Human readable | ✓ Yes | ✓ Yes | ✗ Binary | ✗ Binary | ✗ Binary |
| Has schema | Informal (header) | Self-describing | ✓ Embedded in file | ✓ Embedded + Registry | ✓ Embedded in file |
| Schema evolution | None | Flexible (no enforcement) | Add columns safely | Formal rules + Registry | Add columns safely |
| Supports nesting | ✗ No | ✓ Yes (native) | ✓ Yes | ✓ Yes | ✓ Yes |
| Compression | Low (gzip external) | Low (gzip external) | Very high (column-level) | Medium (block-level) | Very high (column-level) |
| Read speed (analytics) | Slow | Slow | Very fast | Slow | Very fast |
| Write speed (streaming) | Fast | Fast | Slow (must buffer) | Very fast | Slow (must buffer) |
| Predicate pushdown | ✗ No | ✗ No | ✓ Row group stats | ✗ No | ✓ Stripe + row index |
| Splittable (Spark) | Only with codec | NDJSON only | ✓ Row groups | ✓ Data blocks | ✓ Stripes |
| Best use case | Data exchange, exports | APIs, events, configs | Data lake storage, analytics | Kafka streaming, CDC | Hive tables, legacy Hadoop |
| Typical size vs CSV | 1× | 1.5–2× | 0.1–0.3× | 0.5–0.7× | 0.1–0.3× |
Real Storage Numbers — What Each Format Actually Costs
These are real-world compression ratios from a 100 million row orders dataset representative of what you would see in production. The differences are not marginal — they directly translate to cloud storage bills.
Dataset: 100M orders, 12 columns
order_id, customer_id, restaurant_id, city, category,
amount, quantity, status, payment_method, created_at,
delivered_at, promo_code (50% null)
FORMAT UNCOMPRESSED WITH COMPRESSION NOTES
────────────────────────────────────────────────────────────────────────
CSV ~38 GB ~9 GB (gzip) Baseline
JSON (NDJSON) ~72 GB ~14 GB (gzip) Key names repeated
Avro ~22 GB ~11 GB (snappy) Compact binary,
no key repetition
Parquet N/A (always ~4.2 GB (snappy) Columnar + per-column
compressed) ~2.8 GB (zstd) encoding + compression
ORC N/A ~3.9 GB (zlib) Similar to Parquet,
~3.1 GB (snappy) slightly different
compression tradeoffs
COST AT AWS S3 STANDARD PRICING (~$0.023/GB/month):
CSV (gzip): 9.0 GB × $0.023 = $0.21/month
JSON (gzip): 14.0 GB × $0.023 = $0.32/month
Avro (snappy): 11.0 GB × $0.023 = $0.25/month
Parquet (snappy): 4.2 GB × $0.023 = $0.097/month
Parquet (zstd): 2.8 GB × $0.023 = $0.064/month
At 1 BILLION rows (10× scale) — annual cost difference:
CSV: ~$25/month → $300/year
Parquet: ~$6.4/month → $77/year
At 100 BILLION rows (100× scale) — annual cost difference:
CSV: ~$2,520/year
Parquet: ~$770/year
Format choice is a cost decision as much as a performance decision.The format decision for each layer in Medallion Architecture
| Layer | Recommended Format | Why |
|---|---|---|
| Landing Zone | Original format (CSV/JSON/Avro as-is) | Preserve exactly what arrived from the source. Do not transform at this stage — just store. |
| Bronze | Parquet (partitioned by date) | Convert from landing format to Parquet for efficient storage and query. Add metadata. Partition by date for fast time-range scans. |
| Silver | Parquet or Delta Lake (Parquet + transaction log) | Cleaned and structured data. Delta Lake adds ACID transactions, time travel, and safe UPDATE/DELETE for GDPR compliance. |
| Gold | Parquet in data lake or native warehouse tables | Aggregated metrics. Load to warehouse (Snowflake/BigQuery) for fast SQL. Keep Parquet copy in lake for backup and Spark access. |
| Kafka Topics (streaming) | Avro + Schema Registry | Event streaming requires compact binary with schema evolution guarantees. Avro is the industry standard for Kafka payloads. |
Migrating a CSV Data Lake to Parquet — A Real Scenario
You join a 3-year-old e-commerce company as their second data engineer. The data lake on S3 contains 3 years of data — all stored as gzip-compressed CSV files. Total size: 4.2 TB. Monthly S3 cost: $97. Monthly Athena query cost (pay-per-TB-scanned): $840. The analytics team runs 200+ queries per day and complains that dashboards refresh in 8–12 minutes.
Your first investigation: you run the same query against a CSV file and a Parquet version you create from a sample. The query scans 2.4 GB of CSV in 47 seconds. The same query against Parquet scans 180 MB and finishes in 3.2 seconds. 15× faster, 13× less data scanned.
The migration plan you propose: Convert all historical CSV files to Parquet, partitioned by date. Run both formats in parallel for one month while validating row counts and checksums match. Switch all queries to Parquet. Decommission CSV after 90 days.
The results after migration: S3 storage drops from 4.2 TB to 680 GB — a 6.2× reduction. Monthly S3 cost drops from $97 to $16. Monthly Athena cost drops from $840 to $62. Dashboard refresh time drops from 8–12 minutes to 45–90 seconds. Total annual saving: approximately $10,500 in cloud costs. Migration effort: two weeks.
This is not an unusual outcome. Parquet migrations at companies still using CSV in their data lake typically produce 5–10× storage reduction and 10–20× query performance improvement. Format choice is one of the highest-leverage decisions a data engineer makes.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓The single most important format decision is row-oriented vs columnar. CSV, JSON, and Avro are row-oriented — fast for writing individual records. Parquet and ORC are columnar — fast for reading specific columns across millions of rows. Analytical queries need columnar. Streaming needs row-oriented.
- ✓Parquet achieves its performance through two mechanisms working together: columnar storage (read only the columns a query needs) and predicate pushdown using row group statistics (skip entire chunks of rows that cannot match the filter). Combined, these can reduce I/O by 95%+ versus CSV for typical analytical queries.
- ✓CSV is for data exchange between systems, not for data lake storage. Convert CSV to Parquet at the Bronze layer. Keeping CSV in a data lake for analytical use is expensive in storage, slow in query performance, and produces high cloud query costs.
- ✓JSON is for APIs, events, and configs — not for data lake storage at scale. The key repetition overhead makes JSON 3–5× larger than equivalent Parquet. Convert JSON to Parquet when landing in the Bronze layer. Store as NDJSON (one record per line) in the landing zone to enable streaming reads.
- ✓Avro is the right choice for Kafka event streaming because it writes individual records immediately without buffering, has a compact binary format, and has formal schema evolution semantics enforced by a Schema Registry. Never use Parquet for Kafka payloads — it requires buffering thousands of records before writing.
- ✓ORC and Parquet have very similar performance. Choose Parquet for new cloud-native projects — it has broader ecosystem support (Spark, Databricks, dbt, Athena, BigQuery, Snowflake). Use ORC when working in Hive or legacy Hadoop environments where ORC is already the standard.
- ✓Parquet schema evolution works by adding columns — old files without the new column return NULL when read by new readers. This is correct and expected behaviour. Design your transformations to handle NULLs from old files using COALESCE with appropriate defaults.
- ✓Avro schema evolution requires a Schema Registry in Kafka pipelines. Backward-compatible changes (adding fields with defaults) are safe. Breaking changes (removing required fields, renaming, changing types) are rejected by the Registry. This enforcement is the Registry's primary value.
- ✓Partitioning is as important as format choice for query performance. A Parquet file without date partitioning requires scanning all years of data for a last-7-days query. With date partitioning (files stored at date=2026-03-17/ paths), only the relevant date directories are scanned. Always partition by the most common filter column.
- ✓Format choice is a cost decision. At 100 million rows, CSV costs 3× more in S3 storage than Parquet. At 10 billion rows, the annual cost difference between CSV and Parquet can exceed $10,000. Format migrations at companies still using CSV in their data lakes typically produce 5–10× storage reduction and pay back migration effort in weeks.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.