Schemas, Tables, Keys and Indexes — The Building Blocks
The foundation of every database — what each concept is and why it matters.
Why Building Blocks Matter More Than Advanced Techniques
A data engineer who does not deeply understand schemas, keys, and constraints builds pipelines that produce incorrect data — silently, without errors. A missing foreign key lets orphaned records accumulate. A missing NOT NULL constraint lets NULLs propagate into calculations that produce wrong totals. A wrong data type converts a DECIMAL to a FLOAT and silently loses precision in financial data.
These are not theoretical problems. They are the root cause of the majority of data quality incidents at real companies. The investigation almost always traces back to a schema design decision made early in the pipeline's life that seemed fine at the time.
This module builds a complete, precise mental model of every database building block: what it is, how it works internally, what happens when it is missing, and what the best practice is. By the end, you will read a database schema the way a senior engineer reads it — seeing not just what is there, but understanding the implications of every design choice.
What a Schema Is — And Why It Is More Than Just a Container
The word "schema" means different things in different contexts, which creates confusion. In the context of a relational database, a schema is a namespace — a named collection of database objects (tables, views, functions, sequences) that belong together. In the context of a table, schema means the definition of the table's structure — its columns, types, and constraints. In the context of a data lake, schema refers to the structure of files as interpreted at read time.
We will cover all three meanings because you will encounter all three in a data engineering career. For this module, we focus primarily on the first two — the database-level schema and the table-level schema definition.
Database-level schemas — namespaces and organisation
In PostgreSQL, a database can contain multiple schemas. Each schema is a namespace — tables with the same name can exist in different schemas without conflict. The default schema is called public. Most simple applications use only the public schema. Data engineering platforms use multiple schemas deliberately to organise their layers.
-- A Snowflake / PostgreSQL data warehouse organised with schemas:
CREATE SCHEMA landing; -- raw files loaded from sources, not cleaned
CREATE SCHEMA bronze; -- partitioned raw data, Parquet format
CREATE SCHEMA silver; -- cleaned, deduplicated, validated
CREATE SCHEMA gold; -- aggregated, business-ready metrics
CREATE SCHEMA staging; -- temporary tables for in-progress transformations
CREATE SCHEMA audit; -- data quality logs, pipeline run metadata
-- Tables in different schemas can have the same name without conflict:
silver.orders ← cleaned orders, one row per valid order
gold.orders ← does NOT exist here — gold has aggregated metrics
staging.orders ← temporary version while transformation runs
-- Referencing a table with its full qualified name:
SELECT * FROM silver.orders WHERE created_at >= '2026-03-01';
-- Setting the search path so you don't need full qualification:
SET search_path = silver, gold, public;
SELECT * FROM orders; -- PostgreSQL looks in silver first, then gold
-- Why schema separation matters for data engineering:
dbt models write to specific schemas defined in dbt_project.yml
Access control is granted at schema level (GRANT USAGE ON SCHEMA gold TO analyst_role)
Different retention policies per schema (landing: 30 days, gold: permanent)
Pipeline monitoring groups metrics by schema to track each layer's healthTable-level schema — the structure definition
At the table level, a schema is the complete definition of what that table contains: every column's name, data type, nullability, default value, and any constraints. This definition is enforced by the database on every write. Understanding exactly what goes into a table schema — and why each element matters — is what separates a data engineer who builds reliable pipelines from one who builds fragile ones.
CREATE TABLE silver.orders (
-- PRIMARY KEY: unique identifier for this record
order_id BIGINT NOT NULL,
-- FOREIGN KEYS: references to related tables
customer_id BIGINT NOT NULL,
restaurant_id INTEGER NOT NULL,
-- BUSINESS DATA columns
order_amount DECIMAL(10, 2) NOT NULL,
delivery_fee DECIMAL(6, 2) NOT NULL DEFAULT 0.00,
discount_amount DECIMAL(8, 2) NOT NULL DEFAULT 0.00,
-- CATEGORICAL column with value constraint
status VARCHAR(20) NOT NULL
DEFAULT 'placed'
CHECK (status IN ('placed','confirmed','preparing',
'picked_up','delivered','cancelled')),
-- OPTIONAL columns (nullable) — not every order has these
promo_code VARCHAR(50) NULL,
special_instructions TEXT NULL,
-- AUDIT columns — always present in production tables
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
ingested_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
pipeline_run_id UUID NOT NULL,
-- CONSTRAINTS defined inline
CONSTRAINT pk_orders PRIMARY KEY (order_id),
CONSTRAINT fk_orders_customer FOREIGN KEY (customer_id)
REFERENCES silver.customers(customer_id)
ON DELETE RESTRICT,
CONSTRAINT fk_orders_restaurant FOREIGN KEY (restaurant_id)
REFERENCES silver.restaurants(restaurant_id)
ON DELETE RESTRICT,
CONSTRAINT chk_order_amount CHECK (order_amount > 0),
CONSTRAINT chk_delivery_fee CHECK (delivery_fee >= 0)
);
-- Each element answers a question:
-- BIGINT vs INTEGER → how large can this ID grow?
-- NOT NULL vs NULL → is this field always present or sometimes missing?
-- DEFAULT → what value if nothing is provided?
-- CHECK → what values are valid for this field?
-- FOREIGN KEY → what table does this reference?
-- ON DELETE RESTRICT → what happens if the referenced row is deleted?created_at and updated_at timestamps. Add ingested_at and pipeline_run_id to every table in your data platform. These columns are what allow you to answer "when was this record last modified?" and "which pipeline run produced this data?" — questions you will need to answer during every debugging session.Data Types — Why Every Column's Type Is a Real Decision
Choosing a data type is not a formality — it is a decision that affects storage cost, query performance, the range of valid values, the precision of calculations, and what happens when values are compared or sorted. Wrong type choices produce subtle bugs that are hard to find and expensive to fix once data is in production.
Numeric types — the most common source of silent bugs
INTEGER TYPES:
SMALLINT 2 bytes -32,768 to 32,767
Use: age, small counters, status codes
Danger: auto-increment IDs that might exceed 32k
INTEGER 4 bytes -2,147,483,648 to 2,147,483,647 (~2.1 billion)
Use: most IDs, counts, quantities
Danger: Swiggy order IDs exceeded 2B — use BIGINT
BIGINT 8 bytes -9.2 quintillion to 9.2 quintillion
Use: all auto-increment primary keys, timestamps
as milliseconds, transaction IDs at scale
Safe default for any ID column
FLOATING POINT TYPES (approximate — never use for money):
REAL 4 bytes ~7 significant decimal digits
DOUBLE 8 bytes ~15 significant decimal digits
Use: scientific measurements, ML feature values
NEVER use for: money, financial amounts, percentages
Reason: 0.1 + 0.2 = 0.30000000000000004 in IEEE 754
EXACT NUMERIC TYPES (use for money):
DECIMAL(p, s) / NUMERIC(p, s)
p = total digits, s = digits after decimal point
DECIMAL(10, 2) → up to 99,999,999.99
DECIMAL(15, 4) → up to 99,999,999,999.9999
Use: ALL monetary values, financial calculations,
tax amounts, exchange rates
Exact arithmetic — 0.1 + 0.2 = 0.3 exactly
SERIAL / AUTO-INCREMENT (PostgreSQL):
SERIAL = INTEGER with auto-increment sequence
BIGSERIAL = BIGINT with auto-increment sequence
Use BIGSERIAL for all primary keys by default
Note: SERIAL is PostgreSQL-specific; SQL standard uses
GENERATED ALWAYS AS IDENTITYText types — size matters more than you think
VARCHAR(n) Variable length, max n characters
Storage: only uses space for actual content + overhead
Use: names (VARCHAR(200)), emails (VARCHAR(320)),
phone numbers (VARCHAR(20)), status codes (VARCHAR(20))
Always specify n — unbounded VARCHAR is a maintenance trap
CHAR(n) Fixed length, always n characters (padded with spaces)
Use: country codes (CHAR(2)), currency codes (CHAR(3))
Avoid: most use cases — VARCHAR is almost always better
TEXT Variable length, no limit
Storage: same as VARCHAR in PostgreSQL internally
Use: free-form text (notes, descriptions, review text),
JSON stored as text, long URLs
Never use for: fields where length should be bounded
Key rules:
Email: VARCHAR(320) — RFC 5321 max is 320 chars
Phone: VARCHAR(20) — international format + formatting chars
Country: CHAR(2) — ISO 3166-1 alpha-2 always exactly 2 chars
URL: TEXT — can be arbitrarily long
Status: VARCHAR(20) — short controlled vocabulary
Review: TEXT — unbounded free text
UUID: UUID or CHAR(36) — always exactly 36 chars with dashes
The VARCHAR vs TEXT debate in PostgreSQL:
In PostgreSQL, VARCHAR(n) and TEXT have identical performance.
The only difference is VARCHAR(n) enforces a length limit.
Always prefer VARCHAR(n) with a reasonable limit over TEXT
for fields where length should be controlled.Date and time types — the most mishandled category
DATE Calendar date only: 2026-03-17
Storage: 4 bytes
Use: birthdates, event dates, report dates
No timezone — just a calendar date
TIME Time of day only: 20:14:32
Rarely used alone. Usually combined with date.
TIMESTAMP Date + time without timezone: 2026-03-17 20:14:32
Storage: 8 bytes
Use: NEVER for production data that spans timezones
Problem: "20:14:32" means nothing without knowing the timezone
A Mumbai user at 8:14 PM is 14:44 UTC
A London user at 8:14 PM is 20:14 UTC
Stored without timezone, these are indistinguishable
TIMESTAMPTZ Date + time WITH timezone: 2026-03-17T20:14:32+05:30
(TIMESTAMP Storage: 8 bytes (PostgreSQL stores internally as UTC)
WITH TIME Use: ALWAYS use for event timestamps in production
ZONE) Why: unambiguous, converts to local time at query time,
correct cross-timezone comparisons and sorting
THE RULE: always use TIMESTAMPTZ for event timestamps.
Use DATE when you genuinely only need a calendar date
(e.g., birth_date, report_date, expiry_date).
INTERVAL Duration: '3 days', '2 hours 30 minutes', '1 year'
Use: date arithmetic, scheduling intervals
Common bugs from TIMESTAMP without timezone:
- Orders from Indian users at 11:30 PM appear to be from "the next day"
when analysed by a European team in UTC
- Partition pruning fails: WHERE created_at >= '2026-03-17'
evaluates differently in IST vs UTC context
- Daylight saving time transitions produce duplicate or missing hoursSpecial types worth knowing
UUID Universally Unique Identifier
Format: 550e8400-e29b-41d4-a716-446655440000
Storage: 16 bytes (as UUID type) or 36 bytes (as CHAR(36))
Use: globally unique IDs across distributed systems,
order IDs, payment IDs, user IDs at large scale
Advantage: generated independently by any system,
no central sequence required
Disadvantage: random UUIDs create index fragmentation
— use UUIDv7 (time-ordered) for better performance
BOOLEAN TRUE / FALSE / NULL
Storage: 1 byte
Use: flags, feature toggles, yes/no fields
Note: three-valued logic — NULL means "unknown",
not FALSE. WHERE is_premium = TRUE excludes NULLs.
JSONB Binary JSON, indexed
Storage: variable
Use: semi-structured data within a relational table
(metadata, config, variable attributes per record)
Supports: GIN indexes for fast key/value queries
vs JSON: JSONB parses and stores in binary (faster queries),
JSON stores original text (preserves whitespace/key order)
Rule: always prefer JSONB over JSON in PostgreSQL
ARRAY Column containing an array of a specific type
Example: tags TEXT[], phone_numbers VARCHAR(20)[]
Use: tags, labels, multi-value attributes
Caution: arrays break normalisation — usually a sign
you need a separate table with a foreign key
ENUM User-defined type with a fixed set of values
Example: CREATE TYPE order_status AS ENUM ('placed', 'delivered')
Advantage: storage-efficient, enforces values at type level
Disadvantage: adding values requires ALTER TYPE (can be slow)
Recommendation: use VARCHAR + CHECK constraint instead for flexibilityKeys — The Identifiers and Relationships That Hold Data Together
Keys are the mechanism by which a relational database enforces identity and relationships. Without keys, a database is just a collection of independent rows with no way to identify a specific record or connect related records across tables. Keys are what give the "relational" in relational database its meaning.
Primary Key — the identity of a row
A primary key is the column or combination of columns that uniquely identifies each row in a table. Every table should have exactly one primary key. The database enforces two guarantees on a primary key: uniqueness (no two rows can have the same primary key value) and NOT NULL (a row cannot exist without a primary key value). The database automatically creates a unique index on the primary key.
SURROGATE KEY (synthetic, auto-generated):
order_id BIGSERIAL PRIMARY KEY
-- or --
order_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY
-- or --
order_id UUID DEFAULT gen_random_uuid() PRIMARY KEY
Advantages:
→ Stable — never changes even if business data changes
→ Simple — single column, always a known type
→ Efficient — integer PKs are smaller and faster to index than UUIDs
Disadvantages:
→ Meaningless — tells you nothing about the row
→ Requires a separate business key for deduplication
NATURAL KEY (from the domain):
email VARCHAR(320) PRIMARY KEY -- for a users table
isbn CHAR(13) PRIMARY KEY -- for a books table
Advantages:
→ Meaningful — also serves as a lookup key
→ No extra column needed
Disadvantages:
→ Can change (user changes email → cascading update everywhere)
→ Must be truly unique and immutable in the real world
→ Longer values create larger, slower indexes
COMPOSITE KEY (multiple columns together):
PRIMARY KEY (order_id, item_sequence) -- for an order_items table
PRIMARY KEY (student_id, course_id) -- for an enrollment table
Use when: no single column uniquely identifies a row, but a
combination does. Common in junction/association tables.
RECOMMENDATION: use BIGSERIAL or UUID as surrogate PKs for almost
everything. Natural keys seem appealing but cause pain when data
changes. Keep natural keys as UNIQUE constraints, not the PK.Foreign Key — the relationship between tables
A foreign key is a column (or set of columns) in one table that references the primary key of another table. It enforces referential integrity: a row in the child table cannot reference a primary key value that does not exist in the parent table. You cannot have an order for a customer who does not exist.
-- Child table with foreign key references to two parent tables:
CREATE TABLE silver.orders (
order_id BIGINT PRIMARY KEY,
customer_id BIGINT NOT NULL,
restaurant_id INTEGER NOT NULL,
CONSTRAINT fk_customer FOREIGN KEY (customer_id)
REFERENCES silver.customers(customer_id)
ON DELETE RESTRICT -- prevent deleting customer with orders
ON UPDATE CASCADE, -- if customer_id changes, update here too
CONSTRAINT fk_restaurant FOREIGN KEY (restaurant_id)
REFERENCES silver.restaurants(restaurant_id)
ON DELETE RESTRICT
ON UPDATE CASCADE
);
ON DELETE options — what happens when the parent row is deleted:
RESTRICT → PREVENT the delete if child rows reference it (safest for DE)
CASCADE → DELETE all child rows when parent is deleted (dangerous — use carefully)
SET NULL → Set foreign key column to NULL when parent deleted (requires nullable FK)
SET DEFAULT → Set to default value (rare)
NO ACTION → Like RESTRICT but deferred until end of transaction
ON UPDATE options — what happens when the parent primary key changes:
CASCADE → UPDATE the foreign key value in all child rows automatically
RESTRICT → PREVENT the update if child rows reference it
(Same other options as ON DELETE)
REAL IMPACT: what FK violations look like without enforcement:
Without FK constraint on customer_id:
Customer 4201938 gets deleted (GDPR request)
All their orders still exist with customer_id = 4201938
JOIN query: SELECT o.*, c.name FROM orders o JOIN customers c
ON o.customer_id = c.id
→ All of customer 4201938's orders vanish from the result
(no matching customer row → JOIN returns nothing)
→ Metrics show fewer orders than actually exist — silent data loss
With FK constraint ON DELETE RESTRICT:
Attempt to delete customer 4201938 → ERROR: violates foreign key
Constraint forces you to handle the orders first (archive or anonymise)
before the customer record can be removed
FK in data warehouses (Snowflake, BigQuery):
FKs are defined but NOT ENFORCED in most warehouses.
They are documentation and query optimizer hints, not runtime checks.
Data quality must be enforced by pipeline logic instead.
Always add dbt tests for FK relationships in warehouse models.Unique Key — uniqueness without being the primary key
A unique constraint enforces that no two rows in a table have the same value in the constrained column (or combination of columns). Unlike a primary key, a table can have multiple unique constraints, and unique constraint columns can be nullable (though only one NULL is allowed per column by standard SQL).
-- Table with multiple unique constraints:
CREATE TABLE silver.customers (
customer_id BIGINT PRIMARY KEY, -- PK: auto-increment
email VARCHAR(320) NOT NULL UNIQUE, -- business key: email must be unique
phone VARCHAR(20) NULL UNIQUE, -- optional but unique if present
external_id VARCHAR(50) NOT NULL UNIQUE -- ID from source system (Shopify etc.)
);
-- Composite unique constraint:
CREATE TABLE silver.restaurant_menus (
menu_id BIGINT PRIMARY KEY,
restaurant_id INTEGER NOT NULL,
menu_name VARCHAR(100) NOT NULL,
UNIQUE (restaurant_id, menu_name) -- a restaurant can't have two menus with the same name
);
USE CASES for unique constraints:
✓ email address in a users table
✓ external_id from source system (prevents duplicate ingestion)
✓ slug in a CMS (URL must be unique)
✓ (user_id, product_id) in a wishlist table (one wishlist entry per product)
UNIQUE vs PRIMARY KEY:
Primary key: exactly one per table, cannot be NULL
Unique: multiple allowed per table, can be NULL (one NULL per column)
UNIQUE in data engineering pipelines:
External IDs from source systems should always have a UNIQUE constraint.
Without it, a pipeline bug that re-inserts the same record creates duplicates
that are invisible until someone notices metrics are inflated.
With the constraint, the duplicate insert fails visibly with an error.Natural key vs surrogate key — a deeper look
The choice between natural keys and surrogate keys is one of the most debated schema design questions. Here is the practical answer from a data engineering perspective.
| Dimension | Natural Key | Surrogate Key |
|---|---|---|
| Definition | A column that exists in the real world and uniquely identifies the entity (email, ISBN, PAN number) | A generated identifier with no meaning outside the database (auto-increment integer, UUID) |
| Stability | Can change — users change email, companies rebrand | Never changes — it was generated and has no real-world meaning to update |
| Join performance | Slower — larger values (strings) create bigger, slower indexes | Faster — integers are small and fast to index and compare |
| Readability | Self-documenting — you can see what the row represents from the key | Opaque — "47291" means nothing without looking up the row |
| Deduplication | Use as deduplication key — same email = same customer | Cannot deduplicate on its own — need a separate business key column |
| Recommended for | UNIQUE constraint as business key — alongside a surrogate PK | PRIMARY KEY on all tables — stable, small, fast |
The practical recommendation used by most data engineering teams: use a surrogate key (BIGSERIAL or UUID) as the primary key for all tables, and add a UNIQUE constraint on the natural business key column. This gives you the stability and performance of a surrogate PK while still enforcing the uniqueness of the natural key.
Constraints — The Rules That Keep Data Honest
A constraint is a rule enforced by the database on every write operation. When a constraint is violated, the database rejects the operation entirely — the row is not written and the transaction is rolled back. Constraints are the database's built-in data quality layer. They are what makes a relational database more trustworthy than a CSV file, which accepts anything.
There are five types of constraints. Every data engineer should know all five, understand what each one protects against, and be able to identify which constraint is appropriate for each data quality need.
Normalisation — Organising Data to Eliminate Redundancy
Normalisation is the process of organising a database to reduce data redundancy and improve data integrity. It is done by decomposing tables into smaller, more focused tables and defining relationships between them with foreign keys. The goal is to store each fact in exactly one place.
As a data engineer you need to understand normalisation because: the operational databases you ingest from are normalised (which is why you need JOINs to get complete data); your Silver layer should be normalised for integrity; and your Gold layer deliberately denormalises for query performance. Knowing when to normalise and when to denormalise is a core skill.
The normal forms — intuitive explanation
UNNORMALISED TABLE (everything in one table):
order_id | customer_name | customer_email | restaurant_name | items
─────────────────────────────────────────────────────────────────────────
9284751 | Priya Sharma | priya@example.com | Punjabi Dhaba | Butter Chicken, Naan
9284752 | Priya Sharma | priya@example.com | Spice Garden | Masala Dosa
9284753 | Rahul Verma | rahul@example.com | Punjabi Dhaba | Dal Makhani, Rice
Problems:
→ "Priya Sharma" and "priya@example.com" stored 2 times
If her email changes, must update 2 rows (or more if she orders again)
If one update is missed → inconsistency
→ "Punjabi Dhaba" stored 2 times — same duplication problem
→ "items" column contains multiple values — cannot query individual items
FIRST NORMAL FORM (1NF):
Rule: each column contains a single atomic value (no lists/arrays in a cell)
Fix the "items" column by creating one row per item:
order_id | customer_name | restaurant_name | item_name
9284751 | Priya Sharma | Punjabi Dhaba | Butter Chicken
9284751 | Priya Sharma | Punjabi Dhaba | Naan
Better but still has customer/restaurant duplication.
SECOND NORMAL FORM (2NF):
Rule: every non-key column must depend on the ENTIRE primary key,
not just part of it. (Relevant for composite PKs)
In our order_items table with PK (order_id, item_name):
customer_name depends only on order_id, not on item_name
→ move customer_name to the orders table (depends on order_id only)
THIRD NORMAL FORM (3NF) — the target for OLTP:
Rule: every non-key column must depend DIRECTLY on the primary key,
not on another non-key column (no transitive dependencies)
FULLY NORMALISED DESIGN (3NF):
customers: customer_id | name | email
4201938 | Priya Sharma | priya@example.com
1092847 | Rahul Verma | rahul@example.com
restaurants: restaurant_id | name
7823 | Punjabi Dhaba
2341 | Spice Garden
orders: order_id | customer_id | restaurant_id | created_at
9284751 | 4201938 | 7823 | 2026-03-17 20:14
9284752 | 4201938 | 2341 | 2026-03-17 21:05
9284753 | 1092847 | 7823 | 2026-03-17 20:45
order_items: order_id | item_name | quantity | unit_price
9284751 | Butter Chicken | 1 | 320.00
9284751 | Naan | 2 | 30.00
Now: Priya's email is stored exactly ONCE → update in one place
"Punjabi Dhaba" is stored exactly ONCE → update in one place
Each table represents one entity, each column depends on its PKDenormalisation — when to deliberately undo normalisation
Normalisation is the right design for operational databases. For analytical databases, it is often the wrong choice. Normalised schemas require JOINs to reconstruct data, and JOINs on large tables in analytical queries are expensive. The Gold layer in a data warehouse is deliberately denormalised — data is pre-joined so analysts can query with simple SELECT statements rather than multi-table JOINs.
NORMALISED (Silver layer — correct for integrity):
To answer "total revenue by restaurant category":
SELECT r.category, SUM(o.order_amount)
FROM silver.orders o
JOIN silver.restaurants r ON o.restaurant_id = r.id
GROUP BY r.category;
→ Requires JOIN on every query execution
DENORMALISED (Gold layer — correct for performance):
CREATE TABLE gold.daily_order_metrics AS
SELECT
DATE(o.created_at) AS order_date,
r.name AS restaurant_name,
r.category AS restaurant_category,
r.city AS city,
COUNT(*) AS order_count,
SUM(o.order_amount) AS total_revenue,
AVG(o.order_amount) AS avg_order_value
FROM silver.orders o
JOIN silver.restaurants r ON o.restaurant_id = r.id
GROUP BY 1, 2, 3, 4;
Analyst query: SELECT restaurant_category, SUM(total_revenue)
FROM gold.daily_order_metrics
WHERE order_date >= '2026-01-01'
GROUP BY restaurant_category;
→ No JOIN. Just reads from one pre-aggregated table. Fast.
Trade-off: restaurant_category is now stored once per order in the
Gold table. If you need to update the category for all
restaurants in a city, you must rebuild the Gold table.
This is correct — Gold tables are rebuilt on a schedule.
The cost is acceptable for the analytical performance gained.Schema Design Patterns Every Data Engineer Must Know
The audit columns pattern — never skip these
Every table in every layer of your data platform should have a standard set of audit columns. These columns are not part of the business data — they are metadata about when and how each row was created or modified. They are what make debugging possible.
-- Minimum audit columns for OLTP source tables:
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
-- Additional audit columns for data platform tables:
ingested_at TIMESTAMPTZ NOT NULL DEFAULT NOW() -- when pipeline ran
pipeline_run_id UUID NOT NULL -- which specific run
source_system VARCHAR(50) NOT NULL -- which source this came from
record_hash CHAR(64) NULL -- SHA-256 of key fields for dedup
-- Why each one matters:
created_at: "When did this event happen in the real world?"
updated_at: "When was this record last modified in the source?"
ingested_at: "When did our pipeline process this row?"
pipeline_run_id: "Which pipeline run produced this row?"
When a bug is found: re-run that specific run with a fix
source_system: "Which of our 15 data sources did this come from?"
record_hash: "Is this an exact duplicate of a row we already have?"
The difference between created_at and ingested_at:
An order was placed at 2026-03-17 20:14:32 (created_at)
Our pipeline processed it at 2026-03-18 02:14:47 (ingested_at)
These are different by 6 hours — both are useful for different analysis.
Mixing them up produces wrong time-based metrics.The soft delete pattern — never actually delete rows
In most data platforms, rows should never be physically deleted. Instead, add a deleted_at column and a is_deleted flag. Physical deletes break pipeline reruns (data that existed is gone), break historical analysis (the past is now different from what it was), and are impossible to reverse. Soft deletes preserve the history while making deleted rows queryable and filterable.
-- Add to any table where rows "go away":
is_deleted BOOLEAN NOT NULL DEFAULT FALSE
deleted_at TIMESTAMPTZ NULL
-- Mark as deleted instead of DELETE:
UPDATE silver.orders
SET is_deleted = TRUE,
deleted_at = NOW()
WHERE order_id = 9284751;
-- All queries filter to active records:
SELECT * FROM silver.orders WHERE is_deleted = FALSE;
-- Historical analysis still works:
SELECT COUNT(*) FROM silver.orders
WHERE created_at >= '2026-01-01'
AND is_deleted = FALSE; -- current count (excluding deleted)
SELECT COUNT(*) FROM silver.orders
WHERE created_at >= '2026-01-01'; -- all records ever created
-- Create a view that hides deleted rows:
CREATE VIEW silver.active_orders AS
SELECT * FROM silver.orders WHERE is_deleted = FALSE;
-- Analysts always query the view, never the raw table directly.The versioned schema pattern — for tables that change often
When a source system frequently changes its schema — adding columns, changing types, renaming fields — maintain a versioned schema approach. Instead of breaking your pipeline every time the source changes, store the schema version alongside each batch of data and handle multiple schema versions explicitly in your transformation code.
-- Landing zone table with schema version tracking:
CREATE TABLE landing.orders_raw (
batch_id UUID NOT NULL,
schema_version VARCHAR(10) NOT NULL, -- "v1", "v2", "v3"
received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
raw_payload JSONB NOT NULL -- store as JSON, parse in Silver
);
-- Transformation handles each version:
-- dbt model: silver/orders.sql
SELECT
batch_id,
received_at,
CASE schema_version
WHEN 'v1' THEN (raw_payload->>'orderId')::BIGINT
WHEN 'v2' THEN (raw_payload->>'order_id')::BIGINT -- renamed in v2
WHEN 'v3' THEN (raw_payload->>'order_id')::BIGINT -- same as v2
END AS order_id,
CASE schema_version
WHEN 'v1' THEN NULL -- didn't exist in v1
WHEN 'v2' THEN (raw_payload->>'delivery_fee')::DECIMAL(6,2)
WHEN 'v3' THEN (raw_payload->>'delivery_fee')::DECIMAL(6,2)
END AS delivery_fee,
...
FROM landing.orders_raw;
Benefits:
→ New source schema version? Add a new CASE branch.
→ Old data is still processable with its original schema version.
→ Schema change history is explicit and auditable.Reviewing a Schema and Finding Four Silent Data Quality Risks
Your manager asks you to review the schema for the orders table in your new company's staging database and flag any data quality risks. Here is the schema you find:
CREATE TABLE staging.orders (
id INT,
cust VARCHAR,
rest_id INT,
amount FLOAT,
status VARCHAR,
created TIMESTAMP,
updated TIMESTAMP
);Problem 1 — No primary key: The id column has no PRIMARY KEY constraint. Two rows can have the same id. Any CDC pipeline that receives an UPDATE for a specific order ID cannot reliably identify which row to update if duplicates exist. Any JOIN on this table produces fan-out if ids are duplicated.
Problem 2 — FLOAT for money: amount FLOAT will silently accumulate floating point precision errors in financial calculations. ₹349.99 stored as FLOAT may be retrieved as 349.99000000000001. Aggregation of thousands of such values will produce reconciliation failures against the payment processor's exact totals. Must be DECIMAL(10,2).
Problem 3 — TIMESTAMP without timezone: Both created and updated are TIMESTAMP (no timezone). This company serves customers across India, and the Bangalore office is in IST (+5:30). When a report counts "orders placed today" using WHERE DATE(created) = CURRENT_DATE, the result changes depending on whether the query runs in IST or UTC context. Late-night orders (10 PM–12 AM IST) appear on the wrong date in UTC analysis. Must be TIMESTAMPTZ.
Problem 4 — No NOT NULL constraints: No column has a NOT NULL constraint. A pipeline bug that omits customer IDs will insert rows with NULL cust silently. Revenue calculations that JOIN to the customers table will silently exclude these orders (JOIN returns nothing for NULL foreign keys). Metrics will be understated and nobody will know why.
Your rewritten schema:
CREATE TABLE silver.orders (
order_id BIGINT NOT NULL,
customer_id BIGINT NOT NULL,
restaurant_id INTEGER NOT NULL,
order_amount DECIMAL(10, 2) NOT NULL CHECK (order_amount > 0),
status VARCHAR(20) NOT NULL
CHECK (status IN ('placed','confirmed','preparing',
'picked_up','delivered','cancelled')),
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL,
ingested_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
pipeline_run_id UUID NOT NULL,
CONSTRAINT pk_orders PRIMARY KEY (order_id),
CONSTRAINT fk_orders_customer FOREIGN KEY (customer_id)
REFERENCES silver.customers(customer_id)
ON DELETE RESTRICT,
CONSTRAINT fk_orders_restaurant FOREIGN KEY (restaurant_id)
REFERENCES silver.restaurants(restaurant_id)
ON DELETE RESTRICT
);Changes made: added PRIMARY KEY, changed FLOAT to DECIMAL, changed TIMESTAMP to TIMESTAMPTZ, added NOT NULL to all required columns, added CHECK constraint on status, added CHECK on order_amount, renamed columns to be descriptive and consistent, added audit columns. Six changes. Every one of them prevents a real data quality bug.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓A schema at the database level is a namespace that groups related tables. Data platforms use schemas to organise layers: landing, bronze, silver, gold. Schema-level access control (GRANT ON SCHEMA) is how you control who can access each layer.
- ✓Every column's data type is a real decision. BIGINT for all IDs (INTEGER overflows at 2.1 billion). DECIMAL(10,2) for all money (FLOAT accumulates precision errors in financial calculations). TIMESTAMPTZ for all event timestamps (TIMESTAMP without timezone produces wrong results across timezones). VARCHAR(n) with a limit for bounded text.
- ✓NEVER use FLOAT for monetary values. Floating point arithmetic cannot represent most decimal fractions exactly. 0.1 + 0.2 = 0.30000000000000004 in IEEE 754. Use DECIMAL(p,s) — it performs exact decimal arithmetic. This is not a style preference, it is a correctness requirement for financial data.
- ✓ALWAYS use TIMESTAMPTZ (TIMESTAMP WITH TIME ZONE) for event timestamps. TIMESTAMP without timezone is ambiguous — "20:14:32" means different things in IST and UTC. TIMESTAMPTZ stores as UTC internally and converts to local time at query time, producing correct results regardless of session timezone.
- ✓Every table must have a primary key. No exceptions. Without a primary key, rows cannot be uniquely identified, updates may affect the wrong rows, JOINs produce fan-out on duplicates, and CDC updates have no reliable target.
- ✓Foreign key constraints enforce referential integrity — preventing orphaned records. In OLTP databases they are enforced at runtime. In data warehouses (Snowflake, BigQuery) they are defined but NOT enforced — use dbt relationship tests instead.
- ✓The five constraint types and what each protects: NOT NULL (prevents missing required values), UNIQUE (prevents duplicates), CHECK (enforces valid values and ranges), FOREIGN KEY (prevents orphaned references), PRIMARY KEY (combines NOT NULL and UNIQUE for the row identifier).
- ✓Every production table should have audit columns: created_at, updated_at, ingested_at, and pipeline_run_id. These columns answer the two questions you will ask in every debugging session: "when did this happen?" and "which pipeline run produced this data?"
- ✓Normalisation (3NF) is correct for OLTP source tables and the Silver layer — each fact stored once, relationships via foreign keys. Denormalisation is correct for the Gold layer — pre-joined, wide, flat tables that analysts can query without complex JOINs.
- ✓Data warehouses define constraints as documentation but do not enforce them. Your pipeline code and dbt tests are the enforcement layer in a warehouse. Always add at minimum: unique tests on primary key columns, not-null tests on required columns, and relationship tests on foreign key columns.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.