PostgreSQL for Data Engineers
The most important database skill you can learn. From absolute zero — install PostgreSQL, understand tables, write real queries, and build skills that every data engineering job requires.
Install PostgreSQL, create a real database, write SELECT / INSERT / UPDATE / DELETE queries, understand JOINs, and learn the patterns data engineers use every single day.
What is PostgreSQL?
PostgreSQL (often called "Postgres") is a relational database — a system that stores data in organised tables with rows and columns, like a very powerful, programmable spreadsheet.
It has been in development since 1986, it is completely free and open source, and it is trusted by companies like Apple, Instagram, Spotify, and thousands of others to store billions of rows of data.
Open source — no licencing cost, ever. Used by startups and Fortune 500 companies equally.
Handles everything from 1,000 rows to billions of rows. Used in production at massive scale.
If a job says "SQL experience required" — PostgreSQL knowledge covers it completely.
PostgreSQL vs Other Databases
| Database | Type | Free? | When used in DE |
|---|---|---|---|
| PostgreSQL | Relational | ✅ Yes | Source systems, staging, metadata stores |
| MySQL | Relational | ✅ Yes | Web app backends, OLTP source systems |
| SQL Server | Relational | ❌ Paid | Enterprise source systems (common in Azure) |
| BigQuery | Cloud DW | 💳 Pay per query | GCP analytics warehouse |
| Snowflake | Cloud DW | 💳 Pay per use | Multi-cloud analytics warehouse |
| MongoDB | Document | ✅ Free tier | Semi-structured / JSON data |
Step 1 — Download and Install PostgreSQL
Go to postgresql.org/download → choose your operating system → download the installer.
- 1.Run the .exe installer
- 2.Click Next through all steps
- 3.Set a password for the postgres user — remember this!
- 4.Keep default port: 5432
- 5.pgAdmin will be installed automatically
- 1.Download the .dmg installer
- 2.Drag to Applications
- 3.Or use Homebrew: brew install postgresql@16
- 4.Start with: brew services start postgresql@16
- 5.pgAdmin: download separately from pgadmin.org
Step 2 — Open pgAdmin (Your Visual Interface)
pgAdmin is the free visual tool that comes with PostgreSQL. It lets you browse databases, run queries, and see your data without typing everything in a terminal.
Open pgAdmin from your Start Menu / Applications → it opens in your browser → enter the password you set during installation.
Servers
└── PostgreSQL 16
└── Databases
└── postgres ← default database, always exists
└── Schemas → public → Tables
Core Concepts You Must Know
| Term | What It Means | Example |
|---|---|---|
| Database | A container that holds all your tables and data | freshmart_db |
| Table | Like a spreadsheet — rows and columns of data | sales, customers, products |
| Row | One single record in a table | One sale transaction |
| Column | A field/attribute every row has | order_id, amount, date |
| Schema | A folder inside a database to organise tables | raw, silver, gold |
| Query | A question you ask the database in SQL | SELECT * FROM sales |
| Primary Key | A column that uniquely identifies each row | order_id |
| Foreign Key | A column that links to a primary key in another table | store_id → stores.store_id |
| Index | A speed-up structure so queries find rows faster | Index on order_date |
| NULL | A missing or unknown value — not zero, not empty string | A sale with no discount |
Data Types — What Type Does Each Column Store?
Every column in a table must have a data type. This tells PostgreSQL what kind of data to expect and how to store it efficiently.
| Type | Use For | Example |
|---|---|---|
| INTEGER | Whole numbers | 1, 42, 1000 |
| BIGINT | Very large whole numbers | Row counts, IDs in large tables |
| NUMERIC(p,s) | Exact decimals (money) | NUMERIC(10,2) → 29999.99 |
| VARCHAR(n) | Text with max length | VARCHAR(100) for names |
| TEXT | Unlimited length text | Descriptions, notes |
| BOOLEAN | True / False | is_active, is_deleted |
| DATE | Date only | 2024-01-15 |
| TIMESTAMP | Date + Time | 2024-01-15 09:30:00 |
| JSONB | JSON data (binary, fast) | API responses, flexible fields |
Step 3 — Create a Database
In pgAdmin → right-click Databases → Create → Database. Or open the Query Tool and run:
CREATE DATABASE freshmart_db;
Now connect to it — in pgAdmin click on freshmart_db in the left panel, then open the Query Tool (Tools → Query Tool).
Step 4 — Create Your First Table
We will recreate the FreshMart scenario in PostgreSQL. Let us create a stores table and a sales table.
CREATE TABLE stores ( store_id VARCHAR(10) PRIMARY KEY, store_name VARCHAR(100) NOT NULL, city VARCHAR(100) NOT NULL, state VARCHAR(50) NOT NULL, opened_date DATE );
What each part means:
store_id VARCHAR(10) PRIMARY KEY→ Text column, max 10 chars, must be unique — identifies each storestore_name VARCHAR(100) NOT NULL→ Text column, max 100 chars, cannot be emptycity VARCHAR(100) NOT NULL→ City name — requiredopened_date DATE→ Date only — no time. NULL allowed (we did not say NOT NULL)CREATE TABLE sales ( order_id VARCHAR(20) PRIMARY KEY, store_id VARCHAR(10) NOT NULL REFERENCES stores(store_id), product_name VARCHAR(200) NOT NULL, category VARCHAR(100), quantity INTEGER NOT NULL DEFAULT 1, unit_price NUMERIC(10,2) NOT NULL, total_amount NUMERIC(10,2) GENERATED ALWAYS AS (quantity * unit_price) STORED, order_date DATE NOT NULL, created_at TIMESTAMP DEFAULT NOW() );
REFERENCES stores(store_id) is a Foreign Key — it means everystore_id in the sales table MUST exist in the stores table first. PostgreSQL will reject any sale for a store that does not exist. This is called referential integrity.Step 5 — Insert Data
Tables are empty after creation. Let us add some FreshMart data.
INSERT INTO stores (store_id, store_name, city, state, opened_date)
VALUES
('ST001', 'FreshMart New Delhi', 'New Delhi', 'Delhi', '2020-01-15'),
('ST002', 'FreshMart Mumbai', 'Mumbai', 'Maharashtra','2020-03-10'),
('ST003', 'FreshMart Bangalore', 'Bangalore', 'Karnataka', '2020-06-01'),
('ST004', 'FreshMart Chennai', 'Chennai', 'Tamil Nadu','2021-01-20'),
('ST005', 'FreshMart Hyderabad', 'Hyderabad', 'Telangana', '2021-04-05');INSERT INTO sales (order_id, store_id, product_name, category, quantity, unit_price, order_date)
VALUES
('ORD1001', 'ST001', 'Basmati Rice 5kg', 'Grocery', 12, 299.00, '2024-01-15'),
('ORD1002', 'ST001', 'Samsung TV 43inch', 'Electronics', 2, 32000.00,'2024-01-15'),
('ORD1003', 'ST001', 'Amul Butter 500g', 'Dairy', 25, 240.00, '2024-01-15'),
('ORD1004', 'ST002', 'Sunflower Oil 1L', 'Grocery', 18, 145.00, '2024-01-15'),
('ORD1005', 'ST002', 'iPhone 14', 'Electronics', 1, 75000.00,'2024-01-15'),
('ORD1006', 'ST002', 'Amul Milk 1L', 'Dairy', 40, 62.00, '2024-01-15'),
('ORD1007', 'ST003', 'Nike Running Shoes', 'Apparel', 5, 4500.00,'2024-01-16'),
('ORD1008', 'ST003', 'Colgate Toothpaste', 'Personal Care',30, 89.00, '2024-01-16'),
('ORD1009', 'ST004', 'Levis Jeans', 'Apparel', 8, 2999.00,'2024-01-16'),
('ORD1010', 'ST005', 'Dove Soap 100g', 'Personal Care',50, 65.00, '2024-01-16');Step 6 — SELECT Queries (Reading Data)
SELECT is the most important SQL command. It retrieves data from your tables.
-- Get every row and every column SELECT * FROM sales; -- Get only specific columns SELECT order_id, product_name, total_amount FROM sales; -- Filter rows with WHERE SELECT * FROM sales WHERE category = 'Electronics'; -- Multiple conditions SELECT * FROM sales WHERE category = 'Grocery' AND unit_price > 100; -- Sort results SELECT * FROM sales ORDER BY total_amount DESC; -- Limit number of results SELECT * FROM sales ORDER BY order_date DESC LIMIT 5; -- Filter by date SELECT * FROM sales WHERE order_date = '2024-01-15'; -- Filter by date range SELECT * FROM sales WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
-- is a comment in SQL. Anything after -- on a line is ignored by PostgreSQL. Always comment your queries — your future self will thank you.Step 7 — Aggregate Functions (Summarising Data)
Aggregate functions calculate a single value from many rows. These are what data engineers use to build summary reports.
-- Count all rows SELECT COUNT(*) FROM sales; -- Sum of all sales SELECT SUM(total_amount) FROM sales; -- Average order value SELECT ROUND(AVG(total_amount), 2) AS avg_order_value FROM sales; -- Highest and lowest sale SELECT MAX(total_amount) AS highest, MIN(total_amount) AS lowest FROM sales; -- Total sales per category ← this is GROUP BY SELECT category, COUNT(*) AS order_count, SUM(total_amount) AS total_revenue, ROUND(AVG(unit_price),2) AS avg_price FROM sales GROUP BY category ORDER BY total_revenue DESC;
Expected result of the last query:
categoryorder_counttotal_revenueavg_price
Electronics3139000.0046333.33
Apparel226492.003749.50
Dairy25960.00151.00
Step 8 — HAVING (Filter After GROUP BY)
WHERE filters rows before grouping. HAVING filters after grouping. Use HAVING when you want to filter based on an aggregate result.
-- WHERE filters individual rows before grouping SELECT category, SUM(total_amount) AS revenue FROM sales WHERE order_date >= '2024-01-01' -- ← filters rows first GROUP BY category; -- HAVING filters groups after aggregation SELECT category, SUM(total_amount) AS revenue FROM sales GROUP BY category HAVING SUM(total_amount) > 5000; -- ← only show categories with revenue > 5000
Step 9 — JOINs (Combining Tables)
Real data is almost never in a single table. JOINs combine rows from two or more tables based on a related column. This is one of the most important skills in data engineering.
| JOIN Type | Returns | Real World Analogy |
|---|---|---|
| INNER JOIN | Only rows that match in BOTH tables | Customers who placed an order |
| LEFT JOIN | All left rows + matching right rows | All customers, with orders if they have any |
| RIGHT JOIN | All right rows + matching left rows | All orders, with customer details if available |
| FULL JOIN | All rows from both tables, matched where possible | Every customer and every order |
-- INNER JOIN — sales with their store name SELECT s.order_id, s.product_name, s.total_amount, st.store_name, st.city FROM sales s INNER JOIN stores st ON s.store_id = st.store_id; -- Total revenue per city SELECT st.city, COUNT(s.order_id) AS total_orders, SUM(s.total_amount) AS total_revenue FROM stores st LEFT JOIN sales s ON st.store_id = s.store_id GROUP BY st.city ORDER BY total_revenue DESC NULLS LAST;
s and st are aliases — short names for tables used inside a query. FROM sales s means "call the sales table 's' in this query". Aliases make long queries much easier to read.Step 10 — UPDATE and DELETE
-- UPDATE a single row UPDATE stores SET city = 'Bengaluru' WHERE store_id = 'ST003'; -- UPDATE multiple rows UPDATE sales SET category = 'Food & Grocery' WHERE category = 'Grocery'; -- DELETE a specific row DELETE FROM sales WHERE order_id = 'ORD1001'; -- DELETE with a condition DELETE FROM sales WHERE order_date < '2023-01-01';
DELETE FROM saleswithout a WHERE clause deletes every single row in the table instantly — with no undo. Always double-check your WHERE condition before running.Patterns Data Engineers Use Every Day
These are the SQL patterns you will actually write in a real DE job — not just SELECT * FROM table.
1. UPSERT — Insert or Update
In pipelines you often need to insert new records but update existing ones. This is called an upsert.
INSERT INTO stores (store_id, store_name, city, state)
VALUES ('ST001', 'FreshMart New Delhi Updated', 'New Delhi', 'Delhi')
ON CONFLICT (store_id)
DO UPDATE SET
store_name = EXCLUDED.store_name,
city = EXCLUDED.city;
-- If ST001 exists → UPDATE it. If not → INSERT it.2. CTEs — Clean Up Complex Queries
A CTE (Common Table Expression) lets you break a complex query into named steps — like variables in a query.
WITH daily_totals AS (
SELECT
order_date,
SUM(total_amount) AS daily_revenue
FROM sales
GROUP BY order_date
),
high_revenue_days AS (
SELECT * FROM daily_totals
WHERE daily_revenue > 50000
)
SELECT * FROM high_revenue_days ORDER BY daily_revenue DESC;3. Window Functions — Rank and Compare Rows
Window functions calculate a value for each row based on a group of related rows — without collapsing rows like GROUP BY does.
-- Rank products by revenue within each category SELECT product_name, category, total_amount, RANK() OVER (PARTITION BY category ORDER BY total_amount DESC) AS rank_in_category FROM sales; -- Running total of revenue by date SELECT order_date, total_amount, SUM(total_amount) OVER (ORDER BY order_date) AS running_total FROM sales ORDER BY order_date;
4. CREATE TABLE AS — Build Summary Tables
In pipelines you often materialise a query result into a new table — this is the Silver → Gold pattern.
-- Create a Gold-layer summary table CREATE TABLE gold_sales_by_store_category AS SELECT st.store_id, st.city, s.category, COUNT(s.order_id) AS total_orders, SUM(s.total_amount) AS total_revenue, ROUND(AVG(s.unit_price), 2) AS avg_unit_price, MIN(s.order_date) AS first_sale_date, MAX(s.order_date) AS last_sale_date FROM sales s INNER JOIN stores st ON s.store_id = st.store_id GROUP BY st.store_id, st.city, s.category;
Quick Reference — Most Used Commands
🎯 Key Takeaways
- ✓PostgreSQL is a free, open-source relational database — the most important database skill for any data engineer
- ✓Every table must have defined columns with data types — use VARCHAR for text, NUMERIC for money, DATE/TIMESTAMP for time
- ✓SELECT, INSERT, UPDATE, DELETE are the four core operations — always use WHERE with UPDATE and DELETE
- ✓GROUP BY + aggregate functions (SUM, COUNT, AVG) is how you build summaries — the foundation of every report
- ✓JOINs combine data from multiple tables — INNER JOIN for matches only, LEFT JOIN to keep all rows from the left table
- ✓CTEs make complex queries readable by breaking them into named steps — use them in every non-trivial query
- ✓Window functions (RANK, SUM OVER) calculate per-row values across groups without collapsing rows like GROUP BY
- ✓UPSERT (ON CONFLICT DO UPDATE) is the key pattern for idempotent pipeline loads — safe to run multiple times
- ✓CREATE TABLE AS SELECT is how you materialise Gold-layer summary tables — the end result of a pipeline
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.