3 Real SQL Projects
Three production-grade analytical projects — revenue intelligence, customer lifecycle analysis, and schema design — built step by step with business context and interpretation
// Part 00
How to Work Through These Projects
These are not exercises with a predetermined answer at the bottom. They are structured the same way a senior data analyst actually works: receive a business problem, decompose it into analytical questions, write queries that answer each question, interpret the output, and form a recommendation. Each project has five to seven steps — skip none of them, because the reasoning in each step directly informs the next.
Every project uses the FreshCart dataset already loaded in each playground. The data is real and consistent — every query you run will produce results you can actually interpret. By the end of each project, you should be able to write two or three sentences summarizing what you found and what the business should do about it. That synthesis is the point.
Establish the North Star: Baseline KPIs
Before any breakdown, you need one row that anchors everything else. Every percentage in the rest of the report is relative to these totals. If you skip this step, numbers in later slides have no context — is ₹85,000 good or bad? You cannot know without the total.
Write a single query returning: total delivered orders, total revenue, average order value, count of unique customers who ordered, total items sold, and revenue per customer. These are your KPIs.
How to read this output
Note the cancellation_rate_pct alongside revenue. A high cancellation rate with high total revenue can be misleading — you are measuring orders that completed, not orders that were placed. If cancellation_rate is above 10%, that is worth calling out in the executive summary as a separate concern. Revenue_per_customer is the most important single number here: it tells you the average economic value of acquiring one FreshCart customer.
Decompose Revenue by Store — Find the Leaders and Laggards
Revenue concentration by store reveals two things: which stores are driving growth (should receive more investment, inventory priority) and which stores are underperforming relative to their expected contribution (need investigation — staffing problem? Catchment area? Logistics?). You need absolute revenue, order count, average order value, and share of total.
The critical addition: compute each store's deviation from the mean average order value. A store with high revenue but low AOV is processing many small orders. A store with low order count but high AOV has fewer but more valuable customers. These are very different operational profiles requiring different interventions.
How to read this output
The aov_vs_mean column is the most actionable number on this slide. A store that is -₹150 below the mean AOV is not serving high-value customers — its growth lever is increasing basket size (bundle promotions, minimum order discounts) rather than acquiring more customers. A store that is +₹200 above mean AOV already has high-value customers — its lever is increasing visit frequency. The revenue_share_pct column immediately surfaces concentration risk: if two stores account for 60%+ of revenue, a logistics disruption at either location is a business-level risk.
Customer Segment Analysis — Where Revenue Actually Comes From
The Pareto principle in e-commerce: roughly 20% of customers generate 80% of revenue. Your job is to quantify this precisely, identify which loyalty tiers hold that 20%, and compute the economic gap between tiers — which tells you the value of successfully upgrading a Silver customer to Gold.
The metric that matters most here is revenue per customer within each tier, not total revenue. Total revenue is a function of how many customers are in each tier. Revenue per customer tells you the economic profile of one customer in each segment — that is what drives upgrade and retention strategy.
How to read this output
The revenue_concentration_delta is the key insight column. A positive delta means that tier generates a disproportionately large share of revenue relative to its size — these are your most valuable customers per head. A negative delta means the tier has many customers but each generates relatively little revenue. In most datasets, Platinum and Gold tiers show positive deltas (10–30+ points above their customer share), while Bronze shows a large negative delta. This quantifies exactly how much revenue concentration exists. The upgrade value calculation: if a Silver customer generates ₹X per period and a Gold customer generates ₹Y, each successful tier upgrade is worth ₹(Y−X) per period to the business.
Common mistake
A common mistake here is to look only at total_revenue per tier and conclude that Bronze generates the least value. Bronze often has the most customers — total revenue can be misleading. Always look at revenue_per_customer. A tier with 5 Platinum customers each spending ₹5,000 (₹25,000 total) is more valuable per customer than 100 Bronze customers spending ₹400 each (₹40,000 total), even though Platinum has lower total revenue.
Time Trend — Is the Business Growing or Contracting?
Raw revenue numbers without a time axis are nearly useless for decision-making. ₹85,000 this month is good if last month was ₹70,000. It is alarming if last month was ₹120,000. Month-over-month change is the minimum viable trend analysis. Adding a 3-month rolling average smooths seasonal noise so you can distinguish genuine trend from one-off spikes.
The output you want is: monthly revenue, MoM absolute change, MoM percentage change, and a trailing 3-month average. This is the exact format used in weekly business reviews at every data-driven company.
How to read this output
Read the mom_pct_change column first. Two consecutive months of negative MoM growth is a trend. One negative month is noise. Three or more is a crisis. The rolling_3m_avg should be your headline growth metric — it eliminates month-to-month variance. If rolling_3m_avg is declining while individual months occasionally spike, the business has a volatility problem: a few large orders are masking an underlying decline in regular purchase frequency. Compare the aov column alongside revenue: if revenue is growing but aov is declining, growth is coming from volume (more customers or more orders per customer) — that is healthier than growth from a few large orders.
Anomaly Detection — Flag What Needs Investigation
Averages hide problems. A store performing at 40% below the network mean average order value is not just "lower" — it is a signal that something specific is wrong there. The final step of any revenue report is surfacing these anomalies so the operations team knows exactly where to look.
Use standard deviation to define "significantly different from normal". A store whose AOV is more than 1.5 standard deviations below the mean is flagged for investigation. A payment method showing unusually high cancellation rates is flagged for the payments team. These flags convert a passive report into an action list.
The z-score technique
A z-score measures how many standard deviations a data point is from the mean. Z = 0 means exactly average. Z = -2 means two standard deviations below — in a normal distribution, only 2.5% of values fall this far below. When you see a store with z = -1.8 on AOV, you are not looking at normal variation. You are looking at a structural difference that needs explanation. This is the difference between a report that says "Store X has low revenue" and one that says "Store X is a statistical outlier — its AOV is 1.8 standard deviations below the network average, consistent across 45 orders."
Define Churn — The Most Important Decision in the Whole Project
Churn is not a universal concept — it depends entirely on the expected purchase frequency of your business. A customer who hasn't ordered in 30 days might be churned for a daily-delivery app. For a quarterly subscription box, 30 days of silence is completely normal. Before writing a single line of analysis, you must define churn for your specific business context.
Start by computing the distribution of inter-purchase intervals (time between consecutive orders for the same customer). The median inter-purchase interval is your baseline. Churn should be defined as: inactive for more than 2x–3x the median inter-purchase interval. This is data-driven churn definition, not an arbitrary threshold.
How to read this output
The median_gap is your most important output. If median is 14 days, then a customer who has not ordered in 42 days (3x median) is showing a meaningful churn signal. If median is 45 days, then 90 days of silence is the churn threshold. Set your churn definition as: inactive for more than 2x the 75th percentile gap. Using the 75th percentile (not median) accounts for the fact that customers naturally have variable purchasing rhythms — you do not want to flag occasional customers as churned just because they match their own normal pattern.
Build the Customer Behavioral Fingerprint
Before you can detect churn, you need a behavioral baseline for each customer — what their "normal" looks like. A customer who normally orders every 7 days missing a 10-day window is more significant than a customer who normally orders every 60 days missing 10 days. The behavioral fingerprint captures: average inter-purchase interval, average order value, trend in order value over time (growing or shrinking basket), and purchase frequency velocity (are they ordering more or less frequently in recent months vs. their history).
This is the data you would store in a "customer features" table in a production system, refreshed nightly, and used as input to a churn prediction model.
How to read this output
The two trend columns are the early warning signals. value_trend is positive if the customer's recent baskets are larger than their early baskets — a healthy sign. Negative value_trend means the customer is spending less per order over time — a churn precursor. gap_trend is positive if the customer is taking longer between orders recently than they historically did — this is the most important churn signal. A customer whose average gap was 12 days but whose recent 3 orders had a 22-day gap is already showing behavioral change. This is the signal you want to detect 30 days before the customer becomes officially "inactive" by any threshold.
RFM Scoring — Classify Every Customer on Three Dimensions
RFM (Recency, Frequency, Monetary) is the industry-standard framework for segmenting customers by economic behavior. The key insight is that all three dimensions together are much more predictive than any one alone. A customer who ordered yesterday (high recency) but has only ever ordered once (low frequency) for ₹50 (low monetary) is very different from a customer who ordered yesterday, orders every week, and spends ₹1,000 each time.
Score each dimension 1–5 (not 1–3 as in the intro module) for finer segmentation. A total score of 13–15 is a Champion. A customer with R=1, F=4, M=4 is "Used to be great, now gone" — a high-priority win-back target. A customer with R=4, F=1, M=4 placed one large order recently — a promising Prospect who needs a second purchase incentive.
The most valuable cell in the RFM matrix
The "At Risk (was Champion)" segment — R=1 or 2, F=4 or 5, M=4 or 5 — is where your highest win-back ROI lives. These customers have proven they will spend significantly and order frequently. Their recency has dropped, but their historical behavior shows they are capable of high engagement. The expected value of winning one back is (their historical monetary / their historical frequency) × expected remaining orders. This is the exact list the CRM team should be working from for win-back campaigns, with the highest-monetary customers first.
Cohort Retention Analysis — Measure Whether Your Retention is Improving
Cohort retention is the single most important metric for product health. It answers: of the customers who first ordered in month X, what percentage are still ordering 1, 2, 3 months later? If your M1 retention (% still ordering one month after first purchase) is improving across cohorts, your product is getting better at retaining customers. If it is declining, you have a worsening retention problem that more acquisition will not fix.
The standard output is a retention matrix — rows are cohorts (acquisition months), columns are periods (M0, M1, M2...). Each cell shows absolute count and percentage of M0 that is still active. You want to see the percentages stable or improving as you read down the rows (newer cohorts).
How to read this output
Read across each row (one cohort): M1/M0 is your 1-month retention rate. Industry benchmarks for e-commerce are 20–40% M1 retention. Below 20% means most customers try the service once and never return — a product or experience problem. Then read down each column: if M1_retention_pct is 25% for the oldest cohort and 35% for the newest, retention is improving — your product or onboarding is getting better. If it is declining, acquisition is outpacing retention improvement — you are filling a leaky bucket. The cell that matters most for LTV calculation is M3 retention: customers who order in months 0, 1, 2, and 3 are typically loyal long-term customers — their LTV is 3–5x a one-time purchaser.
Build the Prioritized Intervention List
The final output of any churn analysis is actionable: a ranked list of customers who should receive a specific intervention, ordered by the expected value of successful retention. Expected value = (probability of successful win-back × expected future revenue). Since you cannot easily compute probability without a trained model, use a proxy: customers with high historical monetary and frequency who have recently gone silent are both high-value to win back AND historically more likely to respond (they were engaged customers).
Combine all the signals from steps 1–4: days_since_last, gap_trend, rfm score, monetary value. Output a ranked list with a recommended action for each tier of risk.
The intervals_missed column
This is more meaningful than raw days_inactive. A customer whose average purchase interval is 7 days who has been inactive for 21 days has missed 3 intervals — that is deeply abnormal for them specifically. A customer whose average interval is 90 days inactive for 21 days is perfectly on track. The same 21-day number means completely different things for these two customers. intervals_missed normalizes recency to each customer's own behavioral baseline, which is exactly what a production churn model does.
Requirements Analysis — What Questions Must This Schema Answer?
Schema design always starts from queries, not from tables. A schema is good if it efficiently answers the questions the business will ask. Write out every analytical question before touching DDL. If you design tables first, you will discover mid-project that you cannot efficiently answer a critical question and will need to restructure.
The operational questions for this system: What percentage of orders met SLA today / this week / by store? Which delivery partners have the worst SLA compliance? What time-of-day has the highest SLA breach rate? What is the total credit liability issued this month? Which customers received more than two credits (potential policy abuse)? How does weather / order volume affect delivery time?
Why this matters
Every question in that list maps to a JOIN path in the schema. "By delivery partner" means there must be a delivery_partners table and a foreign key linking deliveries to it. "Time-of-day breach rate" means delivery timestamps must be stored with time precision, not just date. "Customer credit count" means a credits table with a customer_id FK. Writing questions first forces you to enumerate all the entities and relationships before you start writing CREATE TABLE — you cannot miss an entity if you derive tables from questions.
Entity Identification and Normalization
From the requirements, extract every entity (noun that needs its own facts stored) and every relationship (FK). Entities: delivery_partners (their name, zone, rating), deliveries (one per order, with timestamps), sla_policies (what the SLA is per store zone), sla_breaches (a breach record when SLA is missed), customer_credits (credit issued per breach).
Normalization check: is any non-key attribute dependent on another non-key attribute (transitive dependency — 3NF violation)? The store's SLA zone should not be stored in the deliveries table — that creates update anomalies if the zone changes. It belongs in a stores extension or an sla_policies lookup. Breach amount should not be hard-coded in deliveries — it belongs in sla_policies so it can change without data migration.
Write the DDL — Every Constraint Has a Reason
Professional DDL does not just create tables — every constraint has a documented reason. NOT NULL because this column is always required to answer a business question. CHECK because this column has a bounded domain (SLA minutes cannot be 0 or negative). UNIQUE because duplicates here would corrupt downstream reports. FOREIGN KEY with ON DELETE RESTRICT (not CASCADE) when the child records are financial — you do not want deleting a delivery partner to cascade-delete their breach records and the customer credits those breaches generated.
Run this in the playground and verify the tables were created correctly by querying sqlite_master.
Index Strategy — Design Indexes From the Query Access Patterns
Indexes are not free — every index slows down writes and consumes storage. Create indexes only where the query access pattern justifies them. For each index, articulate exactly which query it enables and what the cost without it would be. The primary key columns already have indexes in SQLite. You need to add indexes for: every foreign key column (SQLite does not auto-index FK columns, unlike PostgreSQL), and every column that appears in a WHERE clause in the operational queries.
The five operational queries for this system: (1) all deliveries by partner today, (2) breaches in the last 7 days by store, (3) unredeemed credits by customer, (4) SLA compliance rate by hour-of-day, (5) partner performance ranking this month. Map each query to the indexes it needs.
Seed Data and Validate With Analytical Queries
A schema is only proven correct when you can answer the business questions from it. Insert realistic test data, then run all five operational queries. If any query is unexpectedly complex or slow, go back and reconsider the schema — that is schema validation. The complexity of a query is often a signal of a schema design problem.
Insert: 2 delivery partners, 1 SLA policy, 3 deliveries (2 on-time, 1 breached), 1 breach record, 1 credit. Then answer: SLA compliance rate, total credit liability, and which partner had the breach.
How to read this output
Partner 2 (Priya Singh) shows a 100% breach rate in the seed data — one delivery, one breach. In a real dataset with hundreds of deliveries, this query surfaces exactly which partners consistently underperform SLA. The avg_delivery_minutes column enables a complementary analysis: a partner with 0% breaches but avg_delivery_minutes close to the SLA limit (110 out of 120 minutes) is a latent risk — one traffic incident makes them a breacher. The truly reliable partners show low avg with low breach rate.
Schema Retrospective — What Would Break at Scale?
Good engineers design schemas not just for the current load but for 100x the current load. Review the schema and identify: (1) which tables will grow fastest, (2) which queries will become slow as data accumulates, (3) what partitioning or archival strategy will be needed.
In this schema: deliveries and sla_breaches grow with every order — at 10,000 orders/day they accumulate 300,000 rows/month. The operational queries all filter on time columns (dispatched_at, actual_at). Without composite indexes on (partner_id, dispatched_at) and (store_id, actual_at), monthly reports become full table scans. For reporting at scale, a separate analytics table denormalizing delivery + breach + policy into one row per delivery would dramatically speed up aggregations at the cost of some write complexity.
🎯 Key Takeaways
- ✓Revenue intelligence: always compute revenue_per_customer, not just total revenue. Use z-scores to flag stores deviating significantly from the network mean — that is a statistical anomaly, not noise.
- ✓Churn definition must be data-driven: compute the inter-purchase interval distribution, then define churn as inactive for 2–3x the P75 gap. Arbitrary thresholds (30 days, 90 days) ignore each customer's actual purchase rhythm.
- ✓The behavioral fingerprint signals that precede churn: widening gap between recent purchases (gap_trend > 0) and shrinking basket size (value_trend < 0). These appear 2–4 weeks before a customer crosses the official "inactive" threshold.
- ✓RFM scoring on quintiles (not fixed thresholds) makes the segmentation relative to your actual customer population — it adapts to dataset size and distribution.
- ✓The "At Risk (was Champion)" RFM cell — high F and M, low R — is the highest expected-value win-back target. These customers have proven they will spend; you just need to reactivate them.
- ✓Schema design starts from queries, not tables. Write out every analytical question the schema must answer, derive the entities and relationships from those questions, then write DDL.
- ✓Every constraint has a reason: NOT NULL because the column is required for correctness; CHECK for bounded domains; UNIQUE for business uniqueness rules; ON DELETE RESTRICT (not CASCADE) for financial records.
- ✓Index only what you need: every index slows writes. Justify each index with the specific query it enables. Composite indexes (partner_id, dispatched_at) support range queries within a partition. Partial indexes (WHERE redeemed_at IS NULL) index only the rows that matter for operational queries.
You have completed all 62 SQL modules
The curriculum is complete — from SELECT basics to production-grade analytical systems. The next step is real data: find a dataset you are genuinely curious about, define a business question, and build the answer end to end. That is the work.
Back to SQL CurriculumDiscussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.