Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Intermediate+150 XP

PostgreSQL for Data Engineers

The most important database skill you can learn. From absolute zero — install PostgreSQL, understand tables, write real queries, and build skills that every data engineering job requires.

90–120 min March 2026
Level: Absolute Beginner
Requires: Nothing — start from zero
Time: 90–120 minutes
Used in: Almost every DE job
What you will learn

Install PostgreSQL, create a real database, write SELECT / INSERT / UPDATE / DELETE queries, understand JOINs, and learn the patterns data engineers use every single day.

What is PostgreSQL?

PostgreSQL (often called "Postgres") is a relational database — a system that stores data in organised tables with rows and columns, like a very powerful, programmable spreadsheet.

It has been in development since 1986, it is completely free and open source, and it is trusted by companies like Apple, Instagram, Spotify, and thousands of others to store billions of rows of data.

🆓
Free Forever

Open source — no licencing cost, ever. Used by startups and Fortune 500 companies equally.

💪
Extremely Powerful

Handles everything from 1,000 rows to billions of rows. Used in production at massive scale.

🏢
Industry Standard

If a job says "SQL experience required" — PostgreSQL knowledge covers it completely.

PostgreSQL vs Other Databases

DatabaseTypeFree?When used in DE
PostgreSQLRelational✅ YesSource systems, staging, metadata stores
MySQLRelational✅ YesWeb app backends, OLTP source systems
SQL ServerRelational❌ PaidEnterprise source systems (common in Azure)
BigQueryCloud DW💳 Pay per queryGCP analytics warehouse
SnowflakeCloud DW💳 Pay per useMulti-cloud analytics warehouse
MongoDBDocument✅ Free tierSemi-structured / JSON data
💡 Note
As a data engineer you will constantly pull data from PostgreSQL (source systems) and sometimes use it as a staging or metadata store. Understanding it deeply is non-negotiable.
PHASE 1 — INSTALL POSTGRESQL

Step 1 — Download and Install PostgreSQL

Go to postgresql.org/download → choose your operating system → download the installer.

🪟 Windows
  1. 1.Run the .exe installer
  2. 2.Click Next through all steps
  3. 3.Set a password for the postgres user — remember this!
  4. 4.Keep default port: 5432
  5. 5.pgAdmin will be installed automatically
🍎 Mac
  1. 1.Download the .dmg installer
  2. 2.Drag to Applications
  3. 3.Or use Homebrew: brew install postgresql@16
  4. 4.Start with: brew services start postgresql@16
  5. 5.pgAdmin: download separately from pgadmin.org
⚠️ Important
Remember the password you set for the postgres user during installation. You will need it every time you connect. If you forget it, reinstalling is the easiest fix.

Step 2 — Open pgAdmin (Your Visual Interface)

pgAdmin is the free visual tool that comes with PostgreSQL. It lets you browse databases, run queries, and see your data without typing everything in a terminal.

Open pgAdmin from your Start Menu / Applications → it opens in your browser → enter the password you set during installation.

pgAdmin left panel structure

Servers

└── PostgreSQL 16

└── Databases

└── postgres ← default database, always exists

└── Schemas → public → Tables

PHASE 2 — UNDERSTAND THE BASICS

Core Concepts You Must Know

TermWhat It MeansExample
DatabaseA container that holds all your tables and datafreshmart_db
TableLike a spreadsheet — rows and columns of datasales, customers, products
RowOne single record in a tableOne sale transaction
ColumnA field/attribute every row hasorder_id, amount, date
SchemaA folder inside a database to organise tablesraw, silver, gold
QueryA question you ask the database in SQLSELECT * FROM sales
Primary KeyA column that uniquely identifies each roworder_id
Foreign KeyA column that links to a primary key in another tablestore_id → stores.store_id
IndexA speed-up structure so queries find rows fasterIndex on order_date
NULLA missing or unknown value — not zero, not empty stringA sale with no discount

Data Types — What Type Does Each Column Store?

Every column in a table must have a data type. This tells PostgreSQL what kind of data to expect and how to store it efficiently.

TypeUse ForExample
INTEGERWhole numbers1, 42, 1000
BIGINTVery large whole numbersRow counts, IDs in large tables
NUMERIC(p,s)Exact decimals (money)NUMERIC(10,2) → 29999.99
VARCHAR(n)Text with max lengthVARCHAR(100) for names
TEXTUnlimited length textDescriptions, notes
BOOLEANTrue / Falseis_active, is_deleted
DATEDate only2024-01-15
TIMESTAMPDate + Time2024-01-15 09:30:00
JSONBJSON data (binary, fast)API responses, flexible fields
PHASE 3 — CREATE YOUR FIRST DATABASE

Step 3 — Create a Database

In pgAdmin → right-click DatabasesCreateDatabase. Or open the Query Tool and run:

SQL — Create Database
CREATE DATABASE freshmart_db;

Now connect to it — in pgAdmin click on freshmart_db in the left panel, then open the Query Tool (Tools → Query Tool).

Step 4 — Create Your First Table

We will recreate the FreshMart scenario in PostgreSQL. Let us create a stores table and a sales table.

SQL — Create stores table
CREATE TABLE stores (
  store_id    VARCHAR(10)  PRIMARY KEY,
  store_name  VARCHAR(100) NOT NULL,
  city        VARCHAR(100) NOT NULL,
  state       VARCHAR(50)  NOT NULL,
  opened_date DATE
);

What each part means:

store_id VARCHAR(10) PRIMARY KEYText column, max 10 chars, must be unique — identifies each store
store_name VARCHAR(100) NOT NULLText column, max 100 chars, cannot be empty
city VARCHAR(100) NOT NULLCity name — required
opened_date DATEDate only — no time. NULL allowed (we did not say NOT NULL)
SQL — Create sales table
CREATE TABLE sales (
  order_id      VARCHAR(20)    PRIMARY KEY,
  store_id      VARCHAR(10)    NOT NULL REFERENCES stores(store_id),
  product_name  VARCHAR(200)   NOT NULL,
  category      VARCHAR(100),
  quantity      INTEGER        NOT NULL DEFAULT 1,
  unit_price    NUMERIC(10,2)  NOT NULL,
  total_amount  NUMERIC(10,2)  GENERATED ALWAYS AS (quantity * unit_price) STORED,
  order_date    DATE           NOT NULL,
  created_at    TIMESTAMP      DEFAULT NOW()
);
🎯 Pro Tip
REFERENCES stores(store_id) is a Foreign Key — it means everystore_id in the sales table MUST exist in the stores table first. PostgreSQL will reject any sale for a store that does not exist. This is called referential integrity.

Step 5 — Insert Data

Tables are empty after creation. Let us add some FreshMart data.

SQL — Insert stores
INSERT INTO stores (store_id, store_name, city, state, opened_date)
VALUES
  ('ST001', 'FreshMart New Delhi',   'New Delhi',  'Delhi',     '2020-01-15'),
  ('ST002', 'FreshMart Mumbai',      'Mumbai',     'Maharashtra','2020-03-10'),
  ('ST003', 'FreshMart Bangalore',   'Bangalore',  'Karnataka', '2020-06-01'),
  ('ST004', 'FreshMart Chennai',     'Chennai',    'Tamil Nadu','2021-01-20'),
  ('ST005', 'FreshMart Hyderabad',   'Hyderabad',  'Telangana', '2021-04-05');
SQL — Insert sales
INSERT INTO sales (order_id, store_id, product_name, category, quantity, unit_price, order_date)
VALUES
  ('ORD1001', 'ST001', 'Basmati Rice 5kg',    'Grocery',      12,  299.00, '2024-01-15'),
  ('ORD1002', 'ST001', 'Samsung TV 43inch',   'Electronics',   2, 32000.00,'2024-01-15'),
  ('ORD1003', 'ST001', 'Amul Butter 500g',    'Dairy',        25,  240.00, '2024-01-15'),
  ('ORD1004', 'ST002', 'Sunflower Oil 1L',    'Grocery',      18,  145.00, '2024-01-15'),
  ('ORD1005', 'ST002', 'iPhone 14',           'Electronics',   1, 75000.00,'2024-01-15'),
  ('ORD1006', 'ST002', 'Amul Milk 1L',        'Dairy',        40,   62.00, '2024-01-15'),
  ('ORD1007', 'ST003', 'Nike Running Shoes',  'Apparel',       5,  4500.00,'2024-01-16'),
  ('ORD1008', 'ST003', 'Colgate Toothpaste',  'Personal Care',30,   89.00, '2024-01-16'),
  ('ORD1009', 'ST004', 'Levis Jeans',         'Apparel',       8,  2999.00,'2024-01-16'),
  ('ORD1010', 'ST005', 'Dove Soap 100g',      'Personal Care',50,   65.00, '2024-01-16');
PHASE 4 — QUERY YOUR DATA

Step 6 — SELECT Queries (Reading Data)

SELECT is the most important SQL command. It retrieves data from your tables.

Basic SELECT patterns
-- Get every row and every column
SELECT * FROM sales;

-- Get only specific columns
SELECT order_id, product_name, total_amount FROM sales;

-- Filter rows with WHERE
SELECT * FROM sales WHERE category = 'Electronics';

-- Multiple conditions
SELECT * FROM sales WHERE category = 'Grocery' AND unit_price > 100;

-- Sort results
SELECT * FROM sales ORDER BY total_amount DESC;

-- Limit number of results
SELECT * FROM sales ORDER BY order_date DESC LIMIT 5;

-- Filter by date
SELECT * FROM sales WHERE order_date = '2024-01-15';

-- Filter by date range
SELECT * FROM sales WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
🎯 Pro Tip
-- is a comment in SQL. Anything after -- on a line is ignored by PostgreSQL. Always comment your queries — your future self will thank you.

Step 7 — Aggregate Functions (Summarising Data)

Aggregate functions calculate a single value from many rows. These are what data engineers use to build summary reports.

Aggregate functions
-- Count all rows
SELECT COUNT(*) FROM sales;

-- Sum of all sales
SELECT SUM(total_amount) FROM sales;

-- Average order value
SELECT ROUND(AVG(total_amount), 2) AS avg_order_value FROM sales;

-- Highest and lowest sale
SELECT MAX(total_amount) AS highest, MIN(total_amount) AS lowest FROM sales;

-- Total sales per category  ← this is GROUP BY
SELECT
  category,
  COUNT(*)              AS order_count,
  SUM(total_amount)     AS total_revenue,
  ROUND(AVG(unit_price),2) AS avg_price
FROM sales
GROUP BY category
ORDER BY total_revenue DESC;

Expected result of the last query:

categoryorder_counttotal_revenueavg_price

Electronics3139000.0046333.33

Apparel226492.003749.50

Dairy25960.00151.00

Step 8 — HAVING (Filter After GROUP BY)

WHERE filters rows before grouping. HAVING filters after grouping. Use HAVING when you want to filter based on an aggregate result.

WHERE vs HAVING
-- WHERE filters individual rows before grouping
SELECT category, SUM(total_amount) AS revenue
FROM sales
WHERE order_date >= '2024-01-01'    -- ← filters rows first
GROUP BY category;

-- HAVING filters groups after aggregation
SELECT category, SUM(total_amount) AS revenue
FROM sales
GROUP BY category
HAVING SUM(total_amount) > 5000;    -- ← only show categories with revenue > 5000

Step 9 — JOINs (Combining Tables)

Real data is almost never in a single table. JOINs combine rows from two or more tables based on a related column. This is one of the most important skills in data engineering.

JOIN TypeReturnsReal World Analogy
INNER JOINOnly rows that match in BOTH tablesCustomers who placed an order
LEFT JOINAll left rows + matching right rowsAll customers, with orders if they have any
RIGHT JOINAll right rows + matching left rowsAll orders, with customer details if available
FULL JOINAll rows from both tables, matched where possibleEvery customer and every order
JOIN examples
-- INNER JOIN — sales with their store name
SELECT
  s.order_id,
  s.product_name,
  s.total_amount,
  st.store_name,
  st.city
FROM sales s
INNER JOIN stores st ON s.store_id = st.store_id;

-- Total revenue per city
SELECT
  st.city,
  COUNT(s.order_id)       AS total_orders,
  SUM(s.total_amount)     AS total_revenue
FROM stores st
LEFT JOIN sales s ON st.store_id = s.store_id
GROUP BY st.city
ORDER BY total_revenue DESC NULLS LAST;
💡 Note
s and st are aliases — short names for tables used inside a query. FROM sales s means "call the sales table 's' in this query". Aliases make long queries much easier to read.

Step 10 — UPDATE and DELETE

UPDATE and DELETE
-- UPDATE a single row
UPDATE stores
SET city = 'Bengaluru'
WHERE store_id = 'ST003';

-- UPDATE multiple rows
UPDATE sales
SET category = 'Food & Grocery'
WHERE category = 'Grocery';

-- DELETE a specific row
DELETE FROM sales WHERE order_id = 'ORD1001';

-- DELETE with a condition
DELETE FROM sales WHERE order_date < '2023-01-01';
⚠️ Important
Always use WHERE with UPDATE and DELETE. Running DELETE FROM saleswithout a WHERE clause deletes every single row in the table instantly — with no undo. Always double-check your WHERE condition before running.
PHASE 5 — DATA ENGINEERING PATTERNS

Patterns Data Engineers Use Every Day

These are the SQL patterns you will actually write in a real DE job — not just SELECT * FROM table.

1. UPSERT — Insert or Update

In pipelines you often need to insert new records but update existing ones. This is called an upsert.

UPSERT with ON CONFLICT
INSERT INTO stores (store_id, store_name, city, state)
VALUES ('ST001', 'FreshMart New Delhi Updated', 'New Delhi', 'Delhi')
ON CONFLICT (store_id)
DO UPDATE SET
  store_name = EXCLUDED.store_name,
  city       = EXCLUDED.city;
-- If ST001 exists → UPDATE it. If not → INSERT it.

2. CTEs — Clean Up Complex Queries

A CTE (Common Table Expression) lets you break a complex query into named steps — like variables in a query.

CTE example
WITH daily_totals AS (
  SELECT
    order_date,
    SUM(total_amount) AS daily_revenue
  FROM sales
  GROUP BY order_date
),
high_revenue_days AS (
  SELECT * FROM daily_totals
  WHERE daily_revenue > 50000
)
SELECT * FROM high_revenue_days ORDER BY daily_revenue DESC;

3. Window Functions — Rank and Compare Rows

Window functions calculate a value for each row based on a group of related rows — without collapsing rows like GROUP BY does.

Window function examples
-- Rank products by revenue within each category
SELECT
  product_name,
  category,
  total_amount,
  RANK() OVER (PARTITION BY category ORDER BY total_amount DESC) AS rank_in_category
FROM sales;

-- Running total of revenue by date
SELECT
  order_date,
  total_amount,
  SUM(total_amount) OVER (ORDER BY order_date) AS running_total
FROM sales
ORDER BY order_date;

4. CREATE TABLE AS — Build Summary Tables

In pipelines you often materialise a query result into a new table — this is the Silver → Gold pattern.

Create summary table
-- Create a Gold-layer summary table
CREATE TABLE gold_sales_by_store_category AS
SELECT
  st.store_id,
  st.city,
  s.category,
  COUNT(s.order_id)           AS total_orders,
  SUM(s.total_amount)         AS total_revenue,
  ROUND(AVG(s.unit_price), 2) AS avg_unit_price,
  MIN(s.order_date)           AS first_sale_date,
  MAX(s.order_date)           AS last_sale_date
FROM sales s
INNER JOIN stores st ON s.store_id = st.store_id
GROUP BY st.store_id, st.city, s.category;
📌 Real World Example
This is exactly what the Silver → Gold layer looks like in a real Medallion Architecture — you read from cleaned data (Silver) and write a business-ready aggregated table (Gold) that analysts and dashboards query directly.

Quick Reference — Most Used Commands

DDL — Structure
CREATE TABLE
ALTER TABLE
DROP TABLE
TRUNCATE TABLE
DML — Data
SELECT
INSERT INTO
UPDATE ... SET
DELETE FROM
Aggregates
COUNT()
SUM()
AVG()
MAX() / MIN()
Clauses
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT

🎯 Key Takeaways

  • PostgreSQL is a free, open-source relational database — the most important database skill for any data engineer
  • Every table must have defined columns with data types — use VARCHAR for text, NUMERIC for money, DATE/TIMESTAMP for time
  • SELECT, INSERT, UPDATE, DELETE are the four core operations — always use WHERE with UPDATE and DELETE
  • GROUP BY + aggregate functions (SUM, COUNT, AVG) is how you build summaries — the foundation of every report
  • JOINs combine data from multiple tables — INNER JOIN for matches only, LEFT JOIN to keep all rows from the left table
  • CTEs make complex queries readable by breaking them into named steps — use them in every non-trivial query
  • Window functions (RANK, SUM OVER) calculate per-row values across groups without collapsing rows like GROUP BY
  • UPSERT (ON CONFLICT DO UPDATE) is the key pattern for idempotent pipeline loads — safe to run multiple times
  • CREATE TABLE AS SELECT is how you materialise Gold-layer summary tables — the end result of a pipeline
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...