What is Data? How Computers Store Information
The foundation of everything — bits, bytes, files, and why data needs engineers.
What Actually Is Data?
Before you build pipelines, before you write Python, before you think about the cloud — you need to understand what you are actually working with. And most tutorials skip this entirely. They assume you already know. You are going to know it properly.
Data is a recorded observation. That is the simplest accurate definition. Any time something happens in the world and someone or something records it, that recording is data.
When you tap the Swiggy app and order a biryani, a set of facts gets recorded: what you ordered, when you ordered it, from which restaurant, your delivery address, the price, the payment method, the device you used, your location coordinates. All of those facts together form one order record. Swiggy processes over 3 million orders every single day. Each one creates dozens of data points. That is data.
When a Razorpay payment gateway processes a transaction, it records: the merchant, the amount, the currency, the timestamp, the payment instrument used, the success or failure status, the response time. Razorpay handles over 500 million transactions every year. Every single transaction is data.
When you read this page, your browser is generating data — which page you loaded, how long you spent on it, what device you are using, which country you are in. Even reading is data.
Here is where it gets interesting: a fact that is never recorded is not data. A customer who walked into a store, bought something, and left without any system recording that transaction — that sale never became data. It happened in the world, but the world has no memory of it. This is why data engineering exists: to make sure the right facts get captured, stored correctly, and made available when needed.
Data is meaningless without context
The number 42 is not data. It is just a number. But "₹42 — delivery charge — Swiggy order #8734621 — 14 March 2026 — Mumbai" is data. It is a fact about something specific that happened. Context is what turns numbers and text into information.
This distinction matters deeply when you are building data systems. Raw numbers sitting in a file with no column names, no timestamps, no source identification — that is not useful data. A data engineer's job always involves making sure data carries enough context to be trusted and understood.
Binary — How Computers Actually Think
Every computer on earth — your phone, a Flipkart server in Hyderabad, the satellite orbiting 36,000 kilometres above you — stores and processes everything using the same two values. Zero and one. That is it. The entire digital world is built on two states.
This is not a simplification. It is literally true. The reason computers use binary is physical. A transistor — the fundamental building block of every processor and memory chip — is essentially a tiny switch. It can be off or on. No current flowing, or current flowing. Engineers mapped "off" to 0 and "on" to 1. Every piece of data you have ever seen on a screen started as a pattern of these switches.
Why not use more than two states?
You might wonder: why not use ten states (0 through 9) like we do in everyday counting? It would fit more information into each switch. Researchers have tried. The problem is reliability. With two states, a switch is clearly on or clearly off — there is no ambiguity. With ten states, even a tiny variation in electrical voltage could cause the computer to misread a 4 as a 5. At the billions-of-operations-per-second speed that modern processors run, even rare mistakes would cascade into constant errors. Binary is reliable precisely because it is extreme — fully on, or fully off.
How binary represents numbers
In our everyday decimal system, each position in a number represents a power of 10. The number 347 means: 3 hundreds (10²) + 4 tens (10¹) + 7 ones (10⁰).
Binary works exactly the same way, but with powers of 2 instead of powers of 10. Each position can only hold a 0 or a 1.
Position value: 128 64 32 16 8 4 2 1
(2⁷) (2⁶) (2⁵) (2⁴) (2³) (2²) (2¹) (2⁰)
Binary 00001010 = 0 0 0 0 1 0 1 0
= 0 + 0 + 0 + 0 + 8 + 0 + 2 + 0
= 10 (the number ten, in decimal)
Binary 01000001 = 0 1 0 0 0 0 0 1
= 0 + 64 + 0 + 0 + 0 + 0 + 0 + 1
= 65 (the decimal number sixty-five)You do not need to memorise binary-to-decimal conversion. What you need to understand is this: every number your computer works with — every price, every user ID, every timestamp — is ultimately a pattern of zeros and ones in memory. The computer converts between binary and the decimal numbers you see on screen automatically.
Bits, Bytes and Scale — From 0 to Petabytes
A single zero or one is called a bit — short for binary digit. One bit can represent two possible states. Two bits can represent four states (00, 01, 10, 11). Eight bits together form a byte. One byte can represent 256 different values (2⁸ = 256), which is enough to store any single character in the English alphabet, any number from 0 to 255, or one pixel of a very simple image.
1 bit = one 0 or 1
8 bits = 1 byte = can hold 256 different values
Examples of what fits in 1 byte:
The letter 'A' = 65 in decimal = 01000001 in binary
The letter 'a' = 97 in decimal = 01100001 in binary
The number 200 = 200 in decimal = 11001000 in binary
The number 0 = 0 in decimal = 00000000 in binary
The number 255 = 255 in decimal = 11111111 in binaryThe storage scale — and what it means in practice
Now we build up from a single byte to the scale that data engineers actually work with. These are not abstract units — every one of these levels corresponds to a real engineering challenge.
1 Byte (B) = 8 bits
≈ one character of text
1 Kilobyte (KB) = 1,024 bytes
≈ one short text message
≈ half a page of plain text
1 Megabyte (MB) = 1,024 KB = ~1 million bytes
≈ one medium-quality photo
≈ one minute of compressed audio
1 Gigabyte (GB) = 1,024 MB = ~1 billion bytes
≈ one full HD movie (compressed)
≈ 1,000 books as plain text
1 Terabyte (TB) = 1,024 GB = ~1 trillion bytes
≈ 200,000 photos
≈ all books in a large library
1 Petabyte (PB) = 1,024 TB = ~1 quadrillion bytes
≈ Google processes ~20 PB per day
≈ Flipkart's data warehouse: multi-PB scaleReal scale of Indian tech companies
When you join a data engineering team at a mid-size Indian startup, you will typically work with data in the gigabytes to low terabytes range. At a large platform like Zomato, Meesho, or PhonePe, the scale is multi-terabyte to low petabyte. At FAANG India operations, it is petabyte scale.
The reason this matters to you: the scale of data directly determines which tools and approaches you use. A 10 MB CSV file can be opened in Excel. A 10 GB CSV takes seconds to load in Python. A 10 TB dataset cannot fit on a single machine at all — you need distributed systems. Understanding scale is how you know which solution is appropriate.
How Different Kinds of Data Are Stored
Everything is ultimately zeros and ones. But how does a computer turn a photo, a song, or a sentence into zeros and ones — and then perfectly reconstruct the original from them? Understanding this removes all the mystery from data formats, which is something you will deal with constantly as a data engineer.
How numbers are stored
Integers (whole numbers) are stored directly in binary. The question is only how many bits to use. More bits means a wider range of values you can represent.
8-bit integer = 1 byte = values from 0 to 255
(or -128 to 127 if negative numbers are needed)
Use case: age, small counters, status codes
16-bit integer = 2 bytes = values from 0 to 65,535
Use case: port numbers, small IDs
32-bit integer = 4 bytes = values from 0 to ~4.3 billion
Use case: most IDs, counts, quantities
Danger zone: Swiggy order IDs exceeded 2B in 2023
64-bit integer = 8 bytes = values from 0 to ~18.4 quintillion
Use case: timestamps (Unix epoch in milliseconds),
large financial transaction IDsDecimal numbers (like ₹349.99) are stored as floating point numbers. Floating point is a way of storing a number with a decimal point by recording a mantissa (the significant digits) and an exponent (how far to shift the decimal point). This is how your computer stores ₹349.99 — not as the exact value, but as the closest representable binary fraction.
How text is stored — character encoding
Text is stored by mapping each character to a number, then storing that number in binary. The mapping table is called an encoding. The original encoding — ASCII — mapped 128 characters (English letters, numbers, punctuation) to numbers 0 through 127, using 7 bits per character.
Character → Decimal → Binary
'A' → 65 → 01000001
'B' → 66 → 01000010
'a' → 97 → 01100001
'z' → 122 → 01111010
'0' → 48 → 00110000
'9' → 57 → 00111001
' ' (space) → 32 → 00100000
So the word "Data" stored in ASCII:
D = 68 = 01000100
a = 97 = 01100001
t = 116 = 01110100
a = 97 = 01100001
"Data" takes exactly 4 bytes in ASCII.ASCII only covers English. The world has thousands of languages. Unicode was created to solve this — it maps over 140,000 characters from every major human language to unique numbers. UTF-8 is the most widely used encoding that implements Unicode. It is backward-compatible with ASCII for English characters, but uses 2 to 4 bytes for characters outside the ASCII range.
How images are stored
A digital image is a grid of pixels. Each pixel is a colour. Each colour is stored as three numbers — the intensity of red, green, and blue (RGB) — each between 0 and 255, which fits in one byte. So each pixel takes 3 bytes.
A 1920×1080 HD photo (no compression):
1920 pixels wide × 1080 pixels tall = 2,073,600 pixels
Each pixel = 3 bytes (red, green, blue)
Total = 2,073,600 × 3 = 6,220,800 bytes = ~6.2 MB
A 12 megapixel smartphone photo (no compression):
12,000,000 pixels × 3 bytes = 36,000,000 bytes = ~36 MB
After JPEG compression (typical 10:1 ratio):
~36 MB becomes ~3.6 MB
This is why image compression formats like JPEG exist:
they reduce file size by discarding detail the human eye
barely notices. The file is smaller; some data is lost.This is your first introduction to a concept you will use constantly as a data engineer: the trade-off between storage size and data fidelity. Compression reduces size but changes the data. Some compression is lossless (the original can be perfectly reconstructed). Some is lossy (some original data is permanently discarded). Choosing the right approach depends entirely on what the data is used for.
RAM vs Disk — Why Both Exist and Why It Matters
Your computer has two fundamentally different ways of storing data. Most beginners treat them as the same thing. They are not. The difference between them explains why databases are designed the way they are, why some queries are fast and others slow, and why memory management is one of the hardest parts of building data pipelines.
The speed gap is enormous — and it shapes everything
RAM is roughly 100,000 times faster than a traditional hard disk, and about 10–100 times faster than an SSD. This gap explains almost every performance decision in data engineering.
Storage Type Speed Typical Size Cost per GB
─────────────────────────────────────────────────────────────
CPU Cache ~1 ns 4–64 MB Built into CPU
RAM ~100 ns 8–256 GB ~$5–8/GB
NVMe SSD ~100 μs 256 GB–4 TB ~$0.10/GB
SATA SSD ~500 μs 256 GB–8 TB ~$0.06/GB
HDD ~10 ms 1–20 TB ~$0.02/GB
Network Storage ~1 ms–100 ms Unlimited (cloud) ~$0.02–0.05/GB
ns = nanosecond (0.000000001 seconds)
μs = microsecond (0.000001 seconds)
ms = millisecond (0.001 seconds)When you run a Python script that reads a 50 GB CSV file, Python must load it from disk into RAM before it can process anything. If your machine only has 16 GB of RAM, it cannot hold 50 GB at once. It has to read, process, and discard data in chunks — or crash. This is why data engineers write code that processes data in batches, use generators instead of loading everything at once, and choose storage formats (like Parquet) that allow reading only the columns you need.
When you see a query run slowly against a database, the most common reason is that the data needed to answer the query was not in RAM — the database had to read it from disk, which takes orders of magnitude longer. Database indexes exist precisely to minimise how much data must be read from disk.
What a File Actually Is
You have been working with files your entire life — documents, photos, songs, PDFs. But almost nobody learns what a file actually is at the level that matters for engineering. Once you know, a huge number of things will suddenly make sense.
A file is bytes plus metadata
At the lowest level, a file is just a sequence of bytes stored on disk. There is no magic. A text file is bytes that happen to be valid UTF-8 encoded characters. An image file is bytes that represent pixel colours. A CSV file is bytes that happen to follow the convention of comma-separated values.
What makes a file more than just a raw sequence of bytes is its metadata — data about the data. The operating system maintains a file system that tracks:
File metadata stored by the operating system:
name → orders_2026_03_14.csv
location → which sectors on disk hold the bytes
size → 4,827,392 bytes
created_at → 2026-03-14 06:00:01 UTC
modified_at → 2026-03-14 06:00:47 UTC
permissions → who can read, write, or execute it
type → the file extension is just a convention,
not enforced by the OS
The actual file content:
Just bytes. The operating system does not care what they
mean. That is the application's job to interpret.The extension on a file — .csv, .json, .parquet — is just a naming convention. It is a hint to applications about how to interpret the bytes. It is not enforced. You can rename a CSV file to have a .txt extension and the bytes inside do not change. It is still comma-separated data. This is why reading a corrupted or mis-named file can cause confusing errors — the application expects one byte pattern, finds another.
File formats are agreements about byte structure
A file format is a specification that says: "if you arrange bytes in this specific pattern, any application that knows this format can read it correctly." The CSV format says: rows are separated by newlines, values within a row are separated by commas, and the first row is optionally a header. The JSON format says: data is structured as key-value pairs in a specific syntax with braces, brackets, colons, and quotes.
As a data engineer, you will work with dozens of file formats. You will read corrupt files, handle encoding mismatches, deal with files that claim to be one format but contain another, and write code that validates file structure before processing. Understanding that a file is ultimately just bytes following a convention is what lets you debug these problems instead of just being confused by them.
Why Files Are Not Enough — The Case for Databases
If data is just bytes in files, why do databases exist? Why not just use files for everything? This is a legitimate question. The answer explains not just what databases are, but why almost every serious application on earth uses one.
Problem 1 — Finding data in a file requires reading all of it
Imagine Flipkart stores all its customer data in one giant CSV file with 500 million rows. You want to find one specific customer by their email address. The only way is to start at the first row and read every single row until you find the match — or reach the end and confirm they do not exist. This is called a full scan. On a file with 500 million rows, this takes minutes. A database solves this with indexes — data structures that let you jump directly to the row you need in milliseconds.
Problem 2 — Concurrent access breaks files
What happens when two processes try to write to the same file at the same time? Without careful coordination, one write overwrites the other, or both get interleaved in a way that produces garbage data. Imagine two Razorpay servers simultaneously recording payments to the same file. With no coordination, transactions disappear. Databases are built to handle thousands of simultaneous reads and writes safely.
Problem 3 — Files have no concept of transactions
A bank transfer involves two operations: subtract money from account A, add it to account B. If the system crashes after the subtraction but before the addition, the money has vanished. Files have no mechanism to say "either both of these operations happen, or neither does." Databases have transactions — they guarantee that a group of operations either all succeed together, or all fail together, leaving the data in a consistent state.
Problem 4 — Files do not enforce data structure
Nothing stops someone from adding a row to a CSV with the wrong number of columns, or putting text where a number should be, or leaving required fields empty. Databases enforce schemas — rules that define exactly what each column can contain. If you try to insert a row that violates those rules, the database rejects it. This catches bad data before it corrupts your entire dataset.
Files Databases
──────────────────────────────────────────────────────────────────
Sequential access (read all Random access (jump to any
to find one thing) record instantly via indexes)
No concurrency control Built-in concurrent access
(two writers corrupt data) (thousands of writers safely)
No transactions ACID transactions
(crash = data corruption) (crash = automatic recovery)
No schema enforcement Schema enforcement
(garbage in = garbage stored) (invalid data rejected)
No query language SQL / query language
(write code for every search) (one query for any search)
Best for: Best for:
bulk storage operational data
archiving applications
data transfer between systems anything with concurrent accessThis is not to say files are bad. Files are essential. As a data engineer you will work with both extensively. Files — especially efficient formats like Parquet — are the backbone of data lakes and long-term storage. Databases handle the live, transactional data your applications run on. Understanding when to use each is one of the first architectural decisions you will face on the job.
The Scale Problem — Why Data Needs Engineers
You now understand what data is, how it is stored, and why databases exist. But you still have not answered the most important question: why is there a job called "data engineer"? Why not just let the application write data to a database and have analysts query it directly?
The answer is scale, velocity, variety, and conflict.
Scale — the database that runs the app cannot also serve analytics
A transactional database — the one that Swiggy's app writes orders to in real time — is optimised for fast individual reads and writes. It is terrible at the kind of questions analytics needs: "What is the total order value broken down by city and restaurant category for the last 30 days?" Running that query on the live application database would require scanning millions of rows, using enormous amounts of CPU, and slowing down the live app for every user placing an order at that moment. Companies cannot risk that.
The solution is to copy data from the operational database into a separate system built specifically for analytics queries — a data warehouse or data lake. Someone has to build and maintain the pipelines that perform that copy, continuously, reliably, and correctly. That person is the data engineer.
Velocity — data is generated faster than humans can manage it
Zomato generates GPS pings from delivery partners every few seconds. During peak hours, that is potentially millions of events per minute. A human cannot manually process these. They need automated pipelines that capture the stream, aggregate it, and make it available for analysis — all in near real-time. Data engineers design and build those automated systems.
Variety — data comes from dozens of sources in different formats
A typical company has data in: a MySQL production database, a MongoDB collection for product catalogue, Kafka event streams from user actions, CSV files from partner vendors, JSON responses from third-party APIs, Excel files from the finance team, and log files from application servers. All of these need to be brought together, made consistent, and stored in a way that allows unified analysis. Each source requires a different connector, a different parsing approach, and a different validation strategy. That work is data engineering.
Conflict — raw data is almost never usable as-is
Raw data from real systems is messy. Customer names have inconsistent capitalisation. Dates are in three different formats depending on which team created the field. The same product has different IDs in the CRM and the order management system. Null values mean different things in different tables. A data engineer's job includes cleaning, standardising, and validating data before it reaches analysts and scientists — because decisions made on bad data are worse than no data at all.
Data is generated at high speed from many different sources in many different formats. It needs to be moved, cleaned, combined, and stored in a way that allows it to be queried reliably and quickly — without disrupting the systems that generated it in the first place. This pipeline has to run automatically, handle failures gracefully, scale as data volume grows, and produce output that analysts and scientists can trust. Building, running, and improving that pipeline is the job of a data engineer.
Day One at a Bangalore Startup — The Data Problem You Inherit
You join a Series B fintech startup as their first dedicated data engineer. The company has 800,000 active users, processes ₹50 crore in transactions per month, and has a team of four analysts who are all using Excel.
On your third day, your manager sends you a Slack message: "Our analysts need to answer: what is our 30-day retention rate by acquisition channel, broken down by city, for the last 6 months? The product team needs this by Friday."
What you find when you investigate
User data lives in a PostgreSQL database — the same one the app reads and writes to in real time. It has 23 tables with no documentation. Transaction data is in a separate MySQL database managed by a vendor. Acquisition channel data is in a Google Sheet that the marketing team manually updates every Monday. City data is derived from IP addresses at signup, stored as raw IP strings, not city names.
Nobody has connected these systems before. There is no data warehouse. There is no pipeline. The analysts have been manually exporting CSVs from the databases every week and joining them in Excel — and the Excel files are 200 MB and crash regularly.
What this problem is, at its core
This is a data engineering problem. The raw data exists. It is recorded. It is stored in databases. But it is in three different systems, in different formats, with no automated way to bring it together. Someone needs to build the pipeline that: extracts data from all three sources, transforms it into a consistent structure, resolves the IP-to-city mapping, loads it into a single queryable destination, and keeps it updated automatically so the analysts do not have to do any of this manually.
That is your job. And before you can do any of it well, you need to understand exactly what you are dealing with at every level — what the data is, how it is stored, what format it is in, and what happens to it as it moves from source to destination.
This is why Module 01 starts here. Not with tools. Not with the cloud. With the foundation. Because the data engineer who understands data deeply writes pipelines that do not break at 3am.
Five Misconceptions That Hurt Data Engineers
These are the wrong mental models that cause real bugs, bad architectural decisions, and wasted hours. Clear them out now before they become habits.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
These are real errors that appear when working with data at the byte and encoding level. Every data engineer hits them. Now you will understand them when you see them.
🎯 Key Takeaways
- ✓Data is a recorded observation — a fact about the world captured by a system. If it is not recorded, it is not data.
- ✓Every computer stores everything as binary — patterns of 0s and 1s. Understanding this explains every storage decision you will ever make.
- ✓A byte is 8 bits and can hold 256 values. Choosing the right data type (int32 vs int64, float vs decimal) directly affects storage cost and correctness.
- ✓Never store monetary values as floating point. Use integers (paise/cents) or DECIMAL types. Floating point arithmetic accumulates errors that cause financial reconciliation failures.
- ✓RAM is fast but volatile and expensive. Disk is slow but persistent and cheap. Everything in data engineering architecture is shaped by managing this trade-off.
- ✓A file is just bytes — the extension is only a convention. File formats are agreements about byte structure. A mis-named or corrupt file causes a data pipeline to fail in confusing ways.
- ✓Text encoding determines how characters map to bytes. Always use UTF-8. Always declare the encoding explicitly. Never assume.
- ✓Databases exist because files cannot handle concurrent access, do not support transactions, have no indexing, and enforce no schema. Both files and databases have important roles in data engineering.
- ✓Operational databases (OLTP) and analytical databases (OLAP) are built for different workloads. Running analytics on a production database slows the application and produces slow query results. This is why data warehouses and data pipelines exist.
- ✓The data engineer exists because data is generated too fast, from too many sources, in too many formats, for any manual process to handle. The job is to build the automated systems that make raw data reliably usable.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.