Query Best Practices
Write clean, fast, readable SQL your team will thank you for — SARGability, avoiding SELECT *, NULL handling, and formatting conventions
// Part 01
Why Query Style Matters
Bad SQL is not just ugly — it is slow, hard to debug, and breaks in unexpected ways when the dataset grows. Most of the worst-performing queries in production share a small set of anti-patterns: functions applied to indexed columns, vague column selection, missing aliases, and ambiguous NULLs. Learning to recognize and avoid these patterns will make your queries 10x faster and your code 10x easier to review.
This module covers the patterns senior engineers check for in every SQL code review. They are not stylistic preferences — most of them have direct, measurable performance implications.
// Part 02
Never Use SELECT *
SELECT * seems convenient but causes real problems in production: it reads every column from disk even if you only need two, breaks views and application code when table schemas change (new column added = different column order), and makes queries self-documenting only if you know the table schema by heart.
The only acceptable use of SELECT * is in quick exploratory queries directly in a SQL client — never in application code, views, or stored procedures. In CTEs, SELECT * from an inner CTE into a final SELECT is acceptable if you immediately alias what you need.
// Part 03
SARGability — Write Index-Friendly WHERE Clauses
SARGable stands for Search ARGument Able — a WHERE clause the database can use an index to satisfy. A non-SARGable condition forces a full table scan even when a perfect index exists. The single most common cause: applying a function to the indexed column.
The rule: never transform the indexed column
An index on a column stores the raw column values in sorted order. If your WHERE clause applies a function to the column, the database cannot use the index — the computed value is not in the index. Move transformations to the comparison value instead.
// Part 04
Avoid Functions on Indexed Columns in JOINs
The same SARGability rule applies to JOIN conditions. If a JOIN applies a function to either side's join key, the index on that key cannot be used — the database must compute the function for every row and compare, resulting in a nested-loop full scan.
Pro Tip
Run EXPLAIN QUERY PLAN before your query to see whether the database is using an index. If you see "SCAN TABLE" instead of "SEARCH TABLE USING INDEX", your WHERE or JOIN condition is likely non-SARGable. Fix the condition and re-run EXPLAIN QUERY PLAN to confirm the index is now being used.
// Part 05
Always Alias Tables in Multi-Table Queries
Unaliased multi-table queries are hard to read and can be ambiguous when two tables share a column name. Always assign short, meaningful aliases to every table in a query with more than one table. Use the table's initials or a short mnemonic — not t1, t2, t3.
// Part 06
Use EXISTS Instead of COUNT(*) for Existence Checks
A common pattern: checking whether any row matches a condition, then branching on the result. The slow way is COUNT(*) > 0 — it scans all matching rows and counts them before you can compare. The correct way is EXISTS — it stops as soon as the first matching row is found.
// Part 07
NULL Handling — Three Rules You Must Know
NULL is not a value — it is the absence of a value. It obeys different rules than regular values. Getting NULL handling wrong produces silent incorrect results: rows are silently excluded from aggregations, comparisons return the wrong result, and joins drop rows. Three rules cover 90% of NULL bugs.
Rule 1: NULL comparisons always return NULL (not TRUE or FALSE)
Rule 2: Aggregates silently ignore NULL
Rule 3: Use COALESCE or IFNULL to provide a default
// Part 08
Formatting and Readability Conventions
SQL has no official style guide, but teams that adopt consistent conventions find code review, debugging, and onboarding dramatically easier. These are the conventions used by most professional SQL teams:
Uppercase keywords
SELECT, FROM, WHERE, JOIN, GROUP BY, HAVING, ORDER BY, LIMIT
One SELECT column per line
Easier to comment out, reorder, or add a column during debugging
Always use AS for aliases
SELECT total_amount AS revenue — never SELECT total_amount revenue
Align JOIN ON conditions
Makes foreign key relationships visually obvious
Name CTEs semantically
delivered_orders not cte1 — the name documents the purpose
Trailing commas vs leading commas
Team choice — leading commas make it easier to comment out the last column
// Part 09
Common Anti-Patterns to Eliminate
Anti-pattern 1: Implicit GROUP BY
Anti-pattern 2: Correlated subquery in SELECT (N+1)
Anti-pattern 3: DISTINCT as a bug-hider
Anti-pattern 4: OR across different columns in WHERE
// Part 10
Interview Prep — 5 Questions With Complete Answers
SARGable (Search ARGument Able) describes a WHERE clause condition that allows the database engine to use an index to satisfy it. A condition is SARGable when the indexed column appears alone on one side of the comparison, without any function or transformation applied to it. The database can walk the sorted index to find matching rows rather than scanning the full table.
A condition becomes non-SARGable when a function is applied to the indexed column: WHERE YEAR(order_date) = 2024, WHERE UPPER(city) = 'MUMBAI', or WHERE total_amount * 1.18 > 1000. The database cannot use the index on order_date for the first example because the index stores raw dates, not YEAR-extracted integers. The engine must compute YEAR() for every row and compare — a full table scan.
Fix: transform the comparison value instead of the column. Replace WHERE YEAR(order_date) = 2024 with WHERE order_date > '2024-01-01' AND order_date < '2025-01-01'. The column appears unmodified, the index can be used. SARGability matters because on tables with millions of rows, the difference between an index seek and a full scan is measured in seconds vs. milliseconds. Always check EXPLAIN QUERY PLAN to verify index usage.
SELECT * has three production problems. First, it reads every column from disk even if your application only uses two — unnecessary I/O, more data transferred over the network, more memory used. On wide tables (100+ columns, JSONB blobs, large TEXT fields) this overhead is significant. Second, SELECT * is fragile: when a column is added to the table, SELECT * returns it even if the application does not expect it, breaking deserialization code, ORMs that map columns to object fields, or CSV exports with fixed column order. Third, it makes queries unreadable — a reader cannot determine what data the query produces without knowing the table schema.
The only acceptable context for SELECT * is interactive exploration in a SQL client — never in views, stored procedures, application code, or automated reports. In CTEs, SELECT * from a named CTE into a final outer SELECT is acceptable if the final SELECT explicitly names what it needs. Rule of thumb: if the query will run more than once, name the columns.
WHERE filters rows before grouping. HAVING filters groups after grouping. They operate at different stages of query execution. WHERE runs first — it restricts which rows are included in the aggregation. HAVING runs after GROUP BY — it restricts which groups appear in the final result based on aggregated values.
WHERE can only reference columns that exist in the raw table (individual row values). It cannot reference aggregate functions (SUM, COUNT, AVG) because the aggregation has not happened yet. HAVING can reference both the GROUP BY columns and aggregate functions applied to the group.
Performance implication: filter as early as possible using WHERE. If you can eliminate rows before grouping, the aggregation processes fewer rows and runs faster. HAVING should only contain conditions that genuinely require the aggregated value — anything else belongs in WHERE. Example: WHERE order_status = 'Delivered' eliminates non-delivered rows before counting, which is far more efficient than HAVING order_status = 'Delivered' which would group all rows first, then discard non-delivered groups.
CTEs (WITH clauses) are best for: breaking a complex query into readable named steps, referencing the same intermediate result multiple times within one query, and recursive queries. Most databases optimize CTEs inline (treating them as subqueries), so there is typically no performance difference between a CTE and an equivalent subquery — choose CTEs for readability. Use meaningful CTE names that describe what the set represents.
Subqueries are best for: simple one-off filtering (EXISTS / IN subqueries), scalar subqueries that return a single value, and cases where the subquery is used in only one place and is simple enough to read inline. Correlated subqueries run once per outer row — avoid them in SELECT lists when the same result can be achieved with a JOIN (the N+1 problem).
Temporary tables are best for: large intermediate results that need to be queried multiple times (CTEs re-materialize each time in some databases), results that need indexes for subsequent queries (you can CREATE INDEX on a temp table), and multi-step data transformation pipelines where each step produces a significant intermediate dataset. Temp tables have setup overhead; for small datasets, a CTE is faster and requires no cleanup.
The N+1 problem occurs when application code executes 1 query to get N rows, then executes 1 additional query per row to fetch related data — N+1 queries total. Example: fetch 100 orders (1 query), then for each order, fetch the customer name (100 queries) = 101 queries. At 100 orders this is annoying; at 10,000 orders it is catastrophic.
In raw SQL, the N+1 pattern appears as a correlated subquery in the SELECT list: SELECT order_id, (SELECT first_name FROM customers WHERE customer_id = o.customer_id) FROM orders o. This executes one subquery per order row.
Fix: JOIN the related table in the original query. SELECT o.order_id, c.first_name FROM orders o JOIN customers c ON c.customer_id = o.customer_id. The database executes one query with one pass over both tables — O(1) queries regardless of N. In ORM frameworks (Django, Rails, SQLAlchemy), the equivalent fix is eager loading: select_related() in Django, includes() in Rails, joinedload() in SQLAlchemy. The fix is always the same: fetch all needed data in one query, not one query per row.
🎯 Key Takeaways
- ✓Never use SELECT * in production queries — name every column. Explicit columns prevent silent breaks when schema changes and eliminate unnecessary I/O.
- ✓SARGable WHERE clauses allow index use. Never apply functions to indexed columns: WHERE YEAR(col) = 2024 kills the index; WHERE col >= '2024-01-01' uses it.
- ✓Trailing LIKE wildcards ('Amul%') are SARGable and use indexes. Leading wildcards ('%milk%') force full scans. Use full-text search for contains-matching.
- ✓Use EXISTS instead of COUNT(*) > 0 for existence checks. EXISTS stops at the first match; COUNT(*) scans all matching rows.
- ✓NULL is not a value — it is unknown. Use IS NULL / IS NOT NULL (not = NULL). Aggregates silently skip NULLs. Use COALESCE to provide defaults.
- ✓Filter with WHERE before grouping; filter groups with HAVING after grouping. Move conditions to WHERE whenever they do not require aggregate values.
- ✓Correlated subqueries in SELECT are N+1 queries — one per outer row. Replace with a JOIN. The database does one pass regardless of N.
- ✓Avoid DISTINCT as a bug-hider for duplicate-producing JOINs. Investigate why duplicates appear; usually EXISTS or a better JOIN is the fix.
- ✓Format SQL with uppercase keywords, one column per line, semantic table aliases, and named CTEs that describe what the set represents.
What comes next
In Module 60, you apply everything to real data analysis — cohort analysis, RFM segmentation, and revenue breakdowns using the full FreshCart dataset.
Module 60 → SQL for Data AnalysisDiscussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.