Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Advanced

3 Real SQL Projects

Three production-grade analytical projects — revenue intelligence, customer lifecycle analysis, and schema design — built step by step with business context and interpretation

80–100 min April 2026
Section 13 · Real-World & Interview
Real-World & Interview · 3 modulesModule 62

// Part 00

How to Work Through These Projects

These are not exercises with a predetermined answer at the bottom. They are structured the same way a senior data analyst actually works: receive a business problem, decompose it into analytical questions, write queries that answer each question, interpret the output, and form a recommendation. Each project has five to seven steps — skip none of them, because the reasoning in each step directly informs the next.

Every project uses the FreshCart dataset already loaded in each playground. The data is real and consistent — every query you run will produce results you can actually interpret. By the end of each project, you should be able to write two or three sentences summarizing what you found and what the business should do about it. That synthesis is the point.

💡 Note
Each playground is independent — changes you make in one do not carry over. But the dataset is the same everywhere, so results will be consistent across steps. Run every query, read the output, and work through the interpretation prompts before moving to the next step.

PROJECT 01 / 03

30 min

Revenue Intelligence Report

Data Analyst @ FreshCart Analytics Team

Business Problem

The Head of Revenue Operations needs to understand what drove FreshCart's performance this period. She needs to know which stores are leading and lagging, which customer segments are most valuable, whether the business is growing or contracting month-over-month, and where concentration risk exists. She has 20 minutes before the leadership review. She needs numbers, not descriptions.

Your Deliverable

A complete revenue breakdown across store, loyalty tier, payment method, and time — with anomaly flags for stores or segments deviating significantly from the mean, and a one-paragraph written summary you could hand to the CEO.

SQL Skills Applied

GROUP BYWindow FunctionsSUM() OVER()LAG()CASE WHENCTEsHAVINGConditional Aggregation
1

Establish the North Star: Baseline KPIs

Before any breakdown, you need one row that anchors everything else. Every percentage in the rest of the report is relative to these totals. If you skip this step, numbers in later slides have no context — is ₹85,000 good or bad? You cannot know without the total.

Write a single query returning: total delivered orders, total revenue, average order value, count of unique customers who ordered, total items sold, and revenue per customer. These are your KPIs.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

How to read this output

Note the cancellation_rate_pct alongside revenue. A high cancellation rate with high total revenue can be misleading — you are measuring orders that completed, not orders that were placed. If cancellation_rate is above 10%, that is worth calling out in the executive summary as a separate concern. Revenue_per_customer is the most important single number here: it tells you the average economic value of acquiring one FreshCart customer.

2

Decompose Revenue by Store — Find the Leaders and Laggards

Revenue concentration by store reveals two things: which stores are driving growth (should receive more investment, inventory priority) and which stores are underperforming relative to their expected contribution (need investigation — staffing problem? Catchment area? Logistics?). You need absolute revenue, order count, average order value, and share of total.

The critical addition: compute each store's deviation from the mean average order value. A store with high revenue but low AOV is processing many small orders. A store with low order count but high AOV has fewer but more valuable customers. These are very different operational profiles requiring different interventions.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

How to read this output

The aov_vs_mean column is the most actionable number on this slide. A store that is -₹150 below the mean AOV is not serving high-value customers — its growth lever is increasing basket size (bundle promotions, minimum order discounts) rather than acquiring more customers. A store that is +₹200 above mean AOV already has high-value customers — its lever is increasing visit frequency. The revenue_share_pct column immediately surfaces concentration risk: if two stores account for 60%+ of revenue, a logistics disruption at either location is a business-level risk.

3

Customer Segment Analysis — Where Revenue Actually Comes From

The Pareto principle in e-commerce: roughly 20% of customers generate 80% of revenue. Your job is to quantify this precisely, identify which loyalty tiers hold that 20%, and compute the economic gap between tiers — which tells you the value of successfully upgrading a Silver customer to Gold.

The metric that matters most here is revenue per customer within each tier, not total revenue. Total revenue is a function of how many customers are in each tier. Revenue per customer tells you the economic profile of one customer in each segment — that is what drives upgrade and retention strategy.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

How to read this output

The revenue_concentration_delta is the key insight column. A positive delta means that tier generates a disproportionately large share of revenue relative to its size — these are your most valuable customers per head. A negative delta means the tier has many customers but each generates relatively little revenue. In most datasets, Platinum and Gold tiers show positive deltas (10–30+ points above their customer share), while Bronze shows a large negative delta. This quantifies exactly how much revenue concentration exists. The upgrade value calculation: if a Silver customer generates ₹X per period and a Gold customer generates ₹Y, each successful tier upgrade is worth ₹(Y−X) per period to the business.

Common mistake

A common mistake here is to look only at total_revenue per tier and conclude that Bronze generates the least value. Bronze often has the most customers — total revenue can be misleading. Always look at revenue_per_customer. A tier with 5 Platinum customers each spending ₹5,000 (₹25,000 total) is more valuable per customer than 100 Bronze customers spending ₹400 each (₹40,000 total), even though Platinum has lower total revenue.

4

Time Trend — Is the Business Growing or Contracting?

Raw revenue numbers without a time axis are nearly useless for decision-making. ₹85,000 this month is good if last month was ₹70,000. It is alarming if last month was ₹120,000. Month-over-month change is the minimum viable trend analysis. Adding a 3-month rolling average smooths seasonal noise so you can distinguish genuine trend from one-off spikes.

The output you want is: monthly revenue, MoM absolute change, MoM percentage change, and a trailing 3-month average. This is the exact format used in weekly business reviews at every data-driven company.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

How to read this output

Read the mom_pct_change column first. Two consecutive months of negative MoM growth is a trend. One negative month is noise. Three or more is a crisis. The rolling_3m_avg should be your headline growth metric — it eliminates month-to-month variance. If rolling_3m_avg is declining while individual months occasionally spike, the business has a volatility problem: a few large orders are masking an underlying decline in regular purchase frequency. Compare the aov column alongside revenue: if revenue is growing but aov is declining, growth is coming from volume (more customers or more orders per customer) — that is healthier than growth from a few large orders.

5

Anomaly Detection — Flag What Needs Investigation

Averages hide problems. A store performing at 40% below the network mean average order value is not just "lower" — it is a signal that something specific is wrong there. The final step of any revenue report is surfacing these anomalies so the operations team knows exactly where to look.

Use standard deviation to define "significantly different from normal". A store whose AOV is more than 1.5 standard deviations below the mean is flagged for investigation. A payment method showing unusually high cancellation rates is flagged for the payments team. These flags convert a passive report into an action list.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

The z-score technique

A z-score measures how many standard deviations a data point is from the mean. Z = 0 means exactly average. Z = -2 means two standard deviations below — in a normal distribution, only 2.5% of values fall this far below. When you see a store with z = -1.8 on AOV, you are not looking at normal variation. You are looking at a structural difference that needs explanation. This is the difference between a report that says "Store X has low revenue" and one that says "Store X is a statistical outlier — its AOV is 1.8 standard deviations below the network average, consistent across 45 orders."

PROJECT 02 / 03

30 min

Customer Lifecycle and Churn Architecture

Senior Analytics Engineer @ FreshCart Growth Team

Business Problem

The Growth team has a problem: they do not know which customers are about to churn until after they already have. By the time a customer has been inactive for 6 months, win-back costs 5x more than retention would have. You need to build a customer health monitoring system that identifies churn signals early — declining order frequency, shrinking basket size, increasing time between purchases — so that the CRM team can act proactively.

Your Deliverable

A customer lifecycle model that: (1) defines churn rigorously, (2) computes behavioral signals that precede churn, (3) scores every customer on risk level, (4) outputs a prioritized intervention list ranked by expected value of successful retention, (5) validates the model by checking whether the signals actually correlate with eventual churn.

SQL Skills Applied

Cohort AnalysisWindow FunctionsLAG()RFM ScoringDate ArithmeticCASE WHENCTEsConditional AggregationStatistical Correlation
1

Define Churn — The Most Important Decision in the Whole Project

Churn is not a universal concept — it depends entirely on the expected purchase frequency of your business. A customer who hasn't ordered in 30 days might be churned for a daily-delivery app. For a quarterly subscription box, 30 days of silence is completely normal. Before writing a single line of analysis, you must define churn for your specific business context.

Start by computing the distribution of inter-purchase intervals (time between consecutive orders for the same customer). The median inter-purchase interval is your baseline. Churn should be defined as: inactive for more than 2x–3x the median inter-purchase interval. This is data-driven churn definition, not an arbitrary threshold.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

How to read this output

The median_gap is your most important output. If median is 14 days, then a customer who has not ordered in 42 days (3x median) is showing a meaningful churn signal. If median is 45 days, then 90 days of silence is the churn threshold. Set your churn definition as: inactive for more than 2x the 75th percentile gap. Using the 75th percentile (not median) accounts for the fact that customers naturally have variable purchasing rhythms — you do not want to flag occasional customers as churned just because they match their own normal pattern.

2

Build the Customer Behavioral Fingerprint

Before you can detect churn, you need a behavioral baseline for each customer — what their "normal" looks like. A customer who normally orders every 7 days missing a 10-day window is more significant than a customer who normally orders every 60 days missing 10 days. The behavioral fingerprint captures: average inter-purchase interval, average order value, trend in order value over time (growing or shrinking basket), and purchase frequency velocity (are they ordering more or less frequently in recent months vs. their history).

This is the data you would store in a "customer features" table in a production system, refreshed nightly, and used as input to a churn prediction model.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

How to read this output

The two trend columns are the early warning signals. value_trend is positive if the customer's recent baskets are larger than their early baskets — a healthy sign. Negative value_trend means the customer is spending less per order over time — a churn precursor. gap_trend is positive if the customer is taking longer between orders recently than they historically did — this is the most important churn signal. A customer whose average gap was 12 days but whose recent 3 orders had a 22-day gap is already showing behavioral change. This is the signal you want to detect 30 days before the customer becomes officially "inactive" by any threshold.

3

RFM Scoring — Classify Every Customer on Three Dimensions

RFM (Recency, Frequency, Monetary) is the industry-standard framework for segmenting customers by economic behavior. The key insight is that all three dimensions together are much more predictive than any one alone. A customer who ordered yesterday (high recency) but has only ever ordered once (low frequency) for ₹50 (low monetary) is very different from a customer who ordered yesterday, orders every week, and spends ₹1,000 each time.

Score each dimension 1–5 (not 1–3 as in the intro module) for finer segmentation. A total score of 13–15 is a Champion. A customer with R=1, F=4, M=4 is "Used to be great, now gone" — a high-priority win-back target. A customer with R=4, F=1, M=4 placed one large order recently — a promising Prospect who needs a second purchase incentive.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

The most valuable cell in the RFM matrix

The "At Risk (was Champion)" segment — R=1 or 2, F=4 or 5, M=4 or 5 — is where your highest win-back ROI lives. These customers have proven they will spend significantly and order frequently. Their recency has dropped, but their historical behavior shows they are capable of high engagement. The expected value of winning one back is (their historical monetary / their historical frequency) × expected remaining orders. This is the exact list the CRM team should be working from for win-back campaigns, with the highest-monetary customers first.

4

Cohort Retention Analysis — Measure Whether Your Retention is Improving

Cohort retention is the single most important metric for product health. It answers: of the customers who first ordered in month X, what percentage are still ordering 1, 2, 3 months later? If your M1 retention (% still ordering one month after first purchase) is improving across cohorts, your product is getting better at retaining customers. If it is declining, you have a worsening retention problem that more acquisition will not fix.

The standard output is a retention matrix — rows are cohorts (acquisition months), columns are periods (M0, M1, M2...). Each cell shows absolute count and percentage of M0 that is still active. You want to see the percentages stable or improving as you read down the rows (newer cohorts).

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

How to read this output

Read across each row (one cohort): M1/M0 is your 1-month retention rate. Industry benchmarks for e-commerce are 20–40% M1 retention. Below 20% means most customers try the service once and never return — a product or experience problem. Then read down each column: if M1_retention_pct is 25% for the oldest cohort and 35% for the newest, retention is improving — your product or onboarding is getting better. If it is declining, acquisition is outpacing retention improvement — you are filling a leaky bucket. The cell that matters most for LTV calculation is M3 retention: customers who order in months 0, 1, 2, and 3 are typically loyal long-term customers — their LTV is 3–5x a one-time purchaser.

5

Build the Prioritized Intervention List

The final output of any churn analysis is actionable: a ranked list of customers who should receive a specific intervention, ordered by the expected value of successful retention. Expected value = (probability of successful win-back × expected future revenue). Since you cannot easily compute probability without a trained model, use a proxy: customers with high historical monetary and frequency who have recently gone silent are both high-value to win back AND historically more likely to respond (they were engaged customers).

Combine all the signals from steps 1–4: days_since_last, gap_trend, rfm score, monetary value. Output a ranked list with a recommended action for each tier of risk.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

The intervals_missed column

This is more meaningful than raw days_inactive. A customer whose average purchase interval is 7 days who has been inactive for 21 days has missed 3 intervals — that is deeply abnormal for them specifically. A customer whose average interval is 90 days inactive for 21 days is perfectly on track. The same 21-day number means completely different things for these two customers. intervals_missed normalizes recency to each customer's own behavioral baseline, which is exactly what a production churn model does.

PROJECT 03 / 03

30 min

Schema Design: Delivery SLA Tracking System

Database Engineer @ FreshCart Infrastructure Team

Business Problem

FreshCart is launching a delivery SLA guarantee: orders within city limits will be delivered within 2 hours of placement, or the customer receives a ₹100 store credit. The current orders table has a delivery_date column but no delivery timestamps, no SLA tracking, no credit issuance records, and no delivery partner assignment. You need to design a complete schema extension that supports: sub-minute delivery tracking, SLA compliance reporting, automatic credit calculation, and performance analytics by delivery partner, store, and time-of-day.

Your Deliverable

A fully normalized schema design with DDL, constraint rationale, index strategy, and five analytical queries that answer the SLA monitoring questions the operations team will ask daily.

SQL Skills Applied

DDLCREATE TABLENormalization (1NF–3NF)CHECK ConstraintsFOREIGN KEY DesignIndex StrategyAnalytical QueriesTemporal Data Design
1

Requirements Analysis — What Questions Must This Schema Answer?

Schema design always starts from queries, not from tables. A schema is good if it efficiently answers the questions the business will ask. Write out every analytical question before touching DDL. If you design tables first, you will discover mid-project that you cannot efficiently answer a critical question and will need to restructure.

The operational questions for this system: What percentage of orders met SLA today / this week / by store? Which delivery partners have the worst SLA compliance? What time-of-day has the highest SLA breach rate? What is the total credit liability issued this month? Which customers received more than two credits (potential policy abuse)? How does weather / order volume affect delivery time?

Why this matters

Every question in that list maps to a JOIN path in the schema. "By delivery partner" means there must be a delivery_partners table and a foreign key linking deliveries to it. "Time-of-day breach rate" means delivery timestamps must be stored with time precision, not just date. "Customer credit count" means a credits table with a customer_id FK. Writing questions first forces you to enumerate all the entities and relationships before you start writing CREATE TABLE — you cannot miss an entity if you derive tables from questions.

2

Entity Identification and Normalization

From the requirements, extract every entity (noun that needs its own facts stored) and every relationship (FK). Entities: delivery_partners (their name, zone, rating), deliveries (one per order, with timestamps), sla_policies (what the SLA is per store zone), sla_breaches (a breach record when SLA is missed), customer_credits (credit issued per breach).

Normalization check: is any non-key attribute dependent on another non-key attribute (transitive dependency — 3NF violation)? The store's SLA zone should not be stored in the deliveries table — that creates update anomalies if the zone changes. It belongs in a stores extension or an sla_policies lookup. Breach amount should not be hard-coded in deliveries — it belongs in sla_policies so it can change without data migration.

Entity-Relationship reasoning before writing DDL
-- ENTITIES and their key attributes:
-- delivery_partners: id, name, phone, home_zone, vehicle_type, active
-- deliveries:        id, order_id (FK), partner_id (FK), policy_id (FK),
--                    dispatched_at, pickup_at, delivered_at, distance_km
-- sla_policies:      id, store_id (FK), zone_name, sla_minutes, breach_credit_amount
-- sla_breaches:      id, delivery_id (FK), expected_by, actual_at, minutes_late
-- customer_credits:  id, customer_id (FK), breach_id (FK), amount, issued_at, redeemed_at

-- RELATIONSHIPS:
-- orders(1) → deliveries(1)           one order has one delivery
-- delivery_partners(1) → deliveries(N) one partner handles many deliveries
-- sla_policies(1) → deliveries(N)      one policy applies to many deliveries
-- deliveries(1) → sla_breaches(0..1)   a delivery may or may not breach SLA
-- sla_breaches(1) → customer_credits(1) each breach generates one credit

-- NORMALIZATION DECISIONS:
-- breach_credit_amount in sla_policies (NOT in sla_breaches) — 3NF
--   if stored in breach, it duplicates the policy and can drift from policy
-- sla_minutes in sla_policies (NOT in deliveries) — 3NF
--   the deadline is derived: dispatched_at + sla_minutes, not stored
-- zone assignment: store → policy (store_id FK in sla_policies)
--   changing a store's zone only updates one policy row, not all deliveries
3

Write the DDL — Every Constraint Has a Reason

Professional DDL does not just create tables — every constraint has a documented reason. NOT NULL because this column is always required to answer a business question. CHECK because this column has a bounded domain (SLA minutes cannot be 0 or negative). UNIQUE because duplicates here would corrupt downstream reports. FOREIGN KEY with ON DELETE RESTRICT (not CASCADE) when the child records are financial — you do not want deleting a delivery partner to cascade-delete their breach records and the customer credits those breaches generated.

Run this in the playground and verify the tables were created correctly by querying sqlite_master.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
4

Index Strategy — Design Indexes From the Query Access Patterns

Indexes are not free — every index slows down writes and consumes storage. Create indexes only where the query access pattern justifies them. For each index, articulate exactly which query it enables and what the cost without it would be. The primary key columns already have indexes in SQLite. You need to add indexes for: every foreign key column (SQLite does not auto-index FK columns, unlike PostgreSQL), and every column that appears in a WHERE clause in the operational queries.

The five operational queries for this system: (1) all deliveries by partner today, (2) breaches in the last 7 days by store, (3) unredeemed credits by customer, (4) SLA compliance rate by hour-of-day, (5) partner performance ranking this month. Map each query to the indexes it needs.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
5

Seed Data and Validate With Analytical Queries

A schema is only proven correct when you can answer the business questions from it. Insert realistic test data, then run all five operational queries. If any query is unexpectedly complex or slow, go back and reconsider the schema — that is schema validation. The complexity of a query is often a signal of a schema design problem.

Insert: 2 delivery partners, 1 SLA policy, 3 deliveries (2 on-time, 1 breached), 1 breach record, 1 credit. Then answer: SLA compliance rate, total credit liability, and which partner had the breach.

Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…
Loading FreshCart DB…
Ctrl + Enter to run
Loading FreshCart database in your browser…

How to read this output

Partner 2 (Priya Singh) shows a 100% breach rate in the seed data — one delivery, one breach. In a real dataset with hundreds of deliveries, this query surfaces exactly which partners consistently underperform SLA. The avg_delivery_minutes column enables a complementary analysis: a partner with 0% breaches but avg_delivery_minutes close to the SLA limit (110 out of 120 minutes) is a latent risk — one traffic incident makes them a breacher. The truly reliable partners show low avg with low breach rate.

6

Schema Retrospective — What Would Break at Scale?

Good engineers design schemas not just for the current load but for 100x the current load. Review the schema and identify: (1) which tables will grow fastest, (2) which queries will become slow as data accumulates, (3) what partitioning or archival strategy will be needed.

In this schema: deliveries and sla_breaches grow with every order — at 10,000 orders/day they accumulate 300,000 rows/month. The operational queries all filter on time columns (dispatched_at, actual_at). Without composite indexes on (partner_id, dispatched_at) and (store_id, actual_at), monthly reports become full table scans. For reporting at scale, a separate analytics table denormalizing delivery + breach + policy into one row per delivery would dramatically speed up aggregations at the cost of some write complexity.

Scale considerations — additional indexes and partitioning strategy
-- At scale: composite indexes for the most common analytical filters
-- "Partner performance this month"
CREATE INDEX idx_deliveries_partner_dispatched
  ON deliveries(partner_id, dispatched_at);

-- "Store SLA compliance this week"
CREATE INDEX idx_deliveries_policy_dispatched
  ON deliveries(policy_id, dispatched_at);

-- "Unredeemed credits this month"
CREATE INDEX idx_credits_redeemed_issued
  ON customer_credits(redeemed_at, issued_at)
  WHERE redeemed_at IS NULL;     -- partial index — only unredeemed rows indexed

-- Archival strategy:
-- After 90 days: move old deliveries to deliveries_archive (same schema)
-- Keep deliveries table to last 90 days for operational queries
-- Analytical queries JOIN both tables via UNION ALL or a view

-- At PostgreSQL scale: partition deliveries by month
-- CREATE TABLE deliveries_2024_03 PARTITION OF deliveries
--   FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
-- SQLite: emulate with separate tables + UNION ALL view

🎯 Key Takeaways

  • Revenue intelligence: always compute revenue_per_customer, not just total revenue. Use z-scores to flag stores deviating significantly from the network mean — that is a statistical anomaly, not noise.
  • Churn definition must be data-driven: compute the inter-purchase interval distribution, then define churn as inactive for 2–3x the P75 gap. Arbitrary thresholds (30 days, 90 days) ignore each customer's actual purchase rhythm.
  • The behavioral fingerprint signals that precede churn: widening gap between recent purchases (gap_trend > 0) and shrinking basket size (value_trend < 0). These appear 2–4 weeks before a customer crosses the official "inactive" threshold.
  • RFM scoring on quintiles (not fixed thresholds) makes the segmentation relative to your actual customer population — it adapts to dataset size and distribution.
  • The "At Risk (was Champion)" RFM cell — high F and M, low R — is the highest expected-value win-back target. These customers have proven they will spend; you just need to reactivate them.
  • Schema design starts from queries, not tables. Write out every analytical question the schema must answer, derive the entities and relationships from those questions, then write DDL.
  • Every constraint has a reason: NOT NULL because the column is required for correctness; CHECK for bounded domains; UNIQUE for business uniqueness rules; ON DELETE RESTRICT (not CASCADE) for financial records.
  • Index only what you need: every index slows writes. Justify each index with the specific query it enables. Composite indexes (partner_id, dispatched_at) support range queries within a partition. Partial indexes (WHERE redeemed_at IS NULL) index only the rows that matter for operational queries.

You have completed all 62 SQL modules

The curriculum is complete — from SELECT basics to production-grade analytical systems. The next step is real data: find a dataset you are genuinely curious about, define a business question, and build the answer end to end. That is the work.

Back to SQL Curriculum
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...