Removing Duplicates — DISTINCT
Return only unique values, understand how DISTINCT works across single and multiple columns, its performance cost, and when to use GROUP BY instead
// Part 01
The Problem DISTINCT Solves
By default, SELECT returns every row that satisfies your WHERE condition — including duplicates. If ten customers all live in Bangalore, a query for cities returns "Bangalore" ten times. If thirty orders were placed across five stores, a query for store IDs returns five store IDs across thirty rows — with many repeats.
Sometimes you want those repeats — when you are counting transactions, listing orders, or analysing every individual record. But sometimes you want to know the unique set of values — which cities does FreshMart serve, which categories of products exist, which payment methods have been used. This is what DISTINCT does: it eliminates duplicate rows from your result, returning each unique value exactly once.
Same table, same column, one keyword difference — completely different results. The first query answers "what city is each customer in?" The second answers "which cities do our customers come from?"
// Part 02
How DISTINCT Works Internally
Understanding what DISTINCT does inside the database helps you predict its performance and know when to use it.
The deduplication process
When the database executes a DISTINCT query, it collects all rows that would normally be returned, then eliminates any row whose combination of column values has already been seen. The result is the unique set of rows. Internally, the database does this using one of two mechanisms:
Sorting: The database sorts all rows by the SELECT columns. Identical rows end up adjacent, making duplicates easy to identify and skip. This is why DISTINCT queries often show sorted output in practice — though this is a side effect of the implementation, not a guarantee.
Hashing: The database computes a hash of each row's values and uses a hash table to track which combinations have been seen. When a row's hash matches an existing entry, it is a duplicate and is discarded. Hashing avoids the sort step and is often faster when the result fits in memory.
Both approaches require processing every row in the result before any rows can be returned — DISTINCT cannot return the first row until it has seen all rows, because the first row might turn out to be a duplicate of the last row. This is why DISTINCT has a cost that grows with the number of input rows and cannot return results early the way LIMIT can.
Where DISTINCT fits in execution order
DISTINCT is applied after SELECT but before ORDER BY. The execution order for a query with DISTINCT is: FROM → WHERE → SELECT (project columns) → DISTINCT (eliminate duplicates) → ORDER BY → LIMIT. This means DISTINCT operates on the projected columns — the set of columns you listed in SELECT — not on all columns in the table.
// Part 03
DISTINCT on a Single Column
The most common use of DISTINCT is finding the unique values in one column — all distinct cities, all distinct categories, all distinct statuses. This is often called finding the domain or cardinality of a column.
These five queries are the most useful exploratory queries when joining a new project. Before you write any filtering queries on these columns, run DISTINCT to know exactly what values exist in the data — including any unexpected values, typos, or formatting inconsistencies that would cause your WHERE conditions to miss rows.
// Part 04
DISTINCT on Multiple Columns
When you list multiple columns in a DISTINCT query, the database returns each unique combination of those columns — not just unique values in each column independently. A row is a duplicate only if every column in the SELECT list has an identical value.
The combination rule in practice
Consider DISTINCT city, loyalty_tier on a customers table with 20 rows. There are 7 distinct cities and 4 distinct tiers. The number of distinct combinations is NOT 7 + 4 = 11. It is however many unique city-tier pairs actually appear in the data — some cities might have customers at all four tiers, others might only have Bronze and Silver customers. DISTINCT returns only the combinations that genuinely exist.
// Part 05
DISTINCT with WHERE and ORDER BY
DISTINCT works seamlessly with WHERE and ORDER BY. WHERE filters rows before DISTINCT processes them — so DISTINCT only sees and deduplicates the rows that passed the filter. ORDER BY sorts the unique result set after deduplication.
// Part 06
COUNT DISTINCT — Counting Unique Values
One of the most common analytical questions is not "what are the unique values?" but "how many unique values are there?" For this, SQL provides COUNT(DISTINCT column) — it counts the number of distinct non-null values in a column.
COUNT(*) vs COUNT(DISTINCT column) vs COUNT(column)
| Expression | What it counts | NULL handling |
|---|---|---|
| COUNT(*) | Every row, regardless of values | Counts rows even if every column is NULL |
| COUNT(column) | Rows where column is NOT NULL | NULLs are excluded from the count |
| COUNT(DISTINCT column) | Unique non-NULL values in column | NULLs are excluded; only one NULL would be counted anyway |
// Part 07
DISTINCT vs GROUP BY — When to Use Which
DISTINCT and GROUP BY both return unique combinations of values. For simple deduplication, they produce identical results. But they serve different purposes and have different capabilities.
When they diverge is when you want to calculate something per unique value. DISTINCT cannot do this — it only removes duplicates. GROUP BY can aggregate: count how many customers per city, sum revenue per store, find the average price per category. You will learn GROUP BY fully in Module 28, but here is the key distinction:
Performance comparison
For simple deduplication, DISTINCT and GROUP BY have similar performance — both require the database to process all rows and identify unique combinations. GROUP BY often has a slight edge because it is more directly optimised in most database engines. For large tables, if you only need unique values with no aggregation, both are valid — but GROUP BY is preferred in professional code because it is more expressive and extensible (you can add COUNT or SUM later without rewriting the query structure).
🎯 Pro Tip
In production code, prefer GROUP BY over DISTINCT for deduplication when working with large tables — it is more explicit about intent, easier to extend with aggregations, and often slightly faster. Use DISTINCT for quick exploration and profiling, and when the query is simple enough that GROUP BY would add unnecessary verbosity.
// Part 08
DISTINCT and Performance — The Hidden Cost
DISTINCT is not free. Before returning any row, the database must process all rows in the result set and eliminate duplicates. On small tables this is imperceptible. On tables with millions of rows, DISTINCT can be significantly slower than a plain SELECT — and significantly slower than a well-designed GROUP BY with an index.
When DISTINCT is expensive
DISTINCT requires a sort or hash of the entire projected result set. If 5 million rows pass through WHERE, DISTINCT must hash or sort all 5 million before returning any. Memory usage grows with result size — large DISTINCT operations may spill to disk, causing further slowdown.
When DISTINCT is cheap
If the column being deduplicated has an index, the database can often use an index scan to find unique values without processing every row. For a column with 7 distinct values in a 10-million-row table, the database can scan just the index (far smaller than the table) and return 7 values almost instantly. For columns with few distinct values relative to total rows (low cardinality — like city, status, category), DISTINCT is fast even on large tables.
DISTINCT as a debugging smell
Experienced SQL writers know that DISTINCT in a complex query — especially a query with JOINs — is often a sign that something else is wrong. If a JOIN is producing more rows than expected (a fan-out from a one-to-many relationship), adding DISTINCT might mask the problem rather than fix it. Before reaching for DISTINCT, ask: why are there duplicates? If the answer is "my JOIN is returning more rows than I expect," fix the JOIN rather than hiding the extra rows with DISTINCT.
// Part 09
Practical DISTINCT Patterns — Real Business Uses
These are the DISTINCT patterns you will write most frequently in real analytics work.
Schema exploration — what values exist in this column?
Reach analysis — which entities touched a segment?
Data quality checks — find unexpected values
Cardinality profiling — how many unique values?
// Part 10
What This Looks Like at Work
You are a data analyst at Zepto, the quick commerce startup. The product team is preparing a feature that lets customers filter products by brand. Before the engineering team builds the filter UI, they need to know exactly which brands exist in the product catalogue — the complete, deduplicated list with no repeats.
🎯 Pro Tip
DISTINCT is one of the most useful tools for data profiling — understanding a new dataset before writing production queries on it. When you join a new project, spend an hour running SELECT DISTINCT on every important column in every important table. You will find unexpected values, inconsistent capitalisation, typos, deprecated statuses, and missing data — all of which affect every query you will write on that column. Discovering these issues before writing business logic saves hours of debugging later.
// Part 11
Interview Prep — 5 Questions With Complete Answers
SELECT DISTINCT returns only unique rows — it eliminates duplicate rows from the result set before returning them to the caller. A plain SELECT returns every row that satisfies the WHERE condition, including duplicates. The deduplication applies to the complete combination of all columns listed in SELECT: a row is considered a duplicate only if every column in the SELECT list has an identical value to another row.
The practical difference: SELECT city FROM customers returns one row per customer — 20 rows if there are 20 customers, with cities repeated for customers in the same city. SELECT DISTINCT city FROM customers returns one row per unique city — 7 rows if customers are distributed across 7 cities, regardless of how many customers are in each.
Internally, DISTINCT requires the database to process all result rows and eliminate duplicates before returning any. This is done through sorting (duplicates become adjacent) or hashing (track seen values in a hash table). Both approaches require processing the full result set, making DISTINCT a blocking operation — it cannot return partial results early the way LIMIT can. The cost grows with the number of input rows and the number of columns in the SELECT list.
COUNT(*) counts every row in the result set regardless of the values in any column. It includes rows where all columns are NULL. It is the correct function for counting how many rows a query returns — total orders, total customers, total products.
COUNT(column) counts rows where the specified column is NOT NULL. If delivery_date is NULL for undelivered orders, COUNT(delivery_date) counts only the delivered orders — those where delivery_date has a real value. This makes COUNT(column) useful for counting non-missing values: how many orders have been assigned a delivery date, how many employees have a specified manager.
COUNT(DISTINCT column) counts the number of unique non-null values in the specified column. COUNT(DISTINCT customer_id) from the orders table counts how many distinct customers have placed at least one order — not how many total orders, and not including customers who have never ordered. This is the cardinality question: how many unique values exist. A common analytical pattern combining all three: SELECT COUNT(*) AS total_orders, COUNT(delivery_date) AS delivered_orders, COUNT(DISTINCT customer_id) AS unique_customers FROM orders — each answers a different question about the same table.
When multiple columns are listed in a SELECT DISTINCT query, DISTINCT applies to the full combination of all listed columns — not to each column independently. A row is eliminated as a duplicate only if every column in the SELECT list has an identical value to another row. If any one column differs, the row is considered unique and is included in the result.
Concrete example: SELECT DISTINCT city, loyalty_tier FROM customers. The result contains every unique city-tier pair that exists in the data. If Bangalore has customers at Gold and Platinum tiers, two rows appear: (Bangalore, Gold) and (Bangalore, Platinum). If Hyderabad only has Silver customers, one row appears: (Hyderabad, Silver). The total number of result rows is the count of unique combinations — not the sum of distinct values in each column independently.
This combination behaviour is important to understand because it means adding more columns to a DISTINCT query increases the number of rows returned (or keeps it the same — never decreases it). If every combination of city and loyalty_tier is unique, SELECT DISTINCT city, loyalty_tier returns as many rows as SELECT DISTINCT city. Only if multiple rows share the exact same city AND loyalty_tier does DISTINCT reduce the count. The more columns you add, the more specific the combination and the fewer rows get eliminated as duplicates.
Both DISTINCT and GROUP BY can return unique combinations of column values — for simple deduplication they produce identical results. The choice depends on whether you need to calculate anything per unique combination.
Use DISTINCT when you only need the unique values themselves with no aggregation: SELECT DISTINCT city FROM customers. It is concise and communicates intent clearly — you want the unique set of cities, nothing more. DISTINCT is also the appropriate choice in COUNT(DISTINCT column) expressions inside aggregate queries.
Use GROUP BY when you need unique values plus any calculation per group: SELECT city, COUNT(*) AS customer_count FROM customers GROUP BY city. DISTINCT cannot perform this — it only eliminates duplicates, it does not aggregate. GROUP BY is also preferred in production code for large tables because it is more directly optimised in most database engines and is more extensible — you can add SUM, AVG, or MAX columns without changing the query structure. A practical rule: if the query only has DISTINCT with no aggregate functions, GROUP BY is an equally valid and often preferable alternative. If you need aggregation per group, GROUP BY is the only option.
DISTINCT in a JOIN query is a warning sign because it often indicates that the JOIN is producing more rows than intended — and DISTINCT is being used to hide the problem rather than fix it. The most common cause is a one-to-many JOIN that fans out rows: if you join customers to orders on customer_id, and a customer has 5 orders, that customer's row appears 5 times in the result. Adding DISTINCT collapses those 5 rows back to 1 — but you have also lost the information that there were 5 orders, and you are paying the cost of both the fan-out and the deduplication.
The correct fix depends on what you actually want. If you want one row per customer with an order count, use GROUP BY: SELECT customer_id, COUNT(order_id) AS order_count FROM customers JOIN orders USING (customer_id) GROUP BY customer_id. If you want customers who have placed at least one order (existence check), use EXISTS: SELECT customer_id FROM customers WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.customer_id). Both are more correct and more efficient than joining and then applying DISTINCT.
The general principle: when you find yourself adding DISTINCT to remove unexpected duplicates, stop and investigate why the duplicates exist. The answer is almost always a JOIN issue — wrong join column, missing join condition creating a cartesian product, or a one-to-many relationship producing more rows than expected. Fixing the root cause gives you correct results with better performance. DISTINCT on top of a broken JOIN gives you correct-looking results that hide a performance problem and a misunderstood data model.
// Part 12
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓SELECT DISTINCT eliminates duplicate rows from the result. Each unique combination of the listed columns appears exactly once.
- ✓DISTINCT applies to the full combination of ALL columns in SELECT — not to individual columns independently. Adding more columns to SELECT DISTINCT increases or maintains the result count, never decreases it.
- ✓DISTINCT requires processing all result rows before returning any. It cannot return partial results early. On large tables this is expensive — always check whether an index covers the DISTINCT columns.
- ✓COUNT(DISTINCT column) counts unique non-null values in a column. COUNT(*) counts all rows. COUNT(column) counts non-null values. All three answer different questions.
- ✓DISTINCT and GROUP BY produce the same result for simple deduplication. Use DISTINCT for quick deduplication with no aggregation. Use GROUP BY when you need unique values plus any calculation (COUNT, SUM, AVG) per group.
- ✓DISTINCT in a JOIN query is a warning sign — it usually means the JOIN is producing unexpected duplicates. Investigate and fix the JOIN rather than hiding duplicates with DISTINCT.
- ✓The most valuable use of DISTINCT in professional work is schema exploration: run SELECT DISTINCT on key columns in a new table to discover all existing values, including unexpected typos, casing inconsistencies, and deprecated statuses.
- ✓Low-cardinality columns (few distinct values like status, category, tier) are cheap to DISTINCT on. High-cardinality columns (many distinct values like email, order_id) are expensive without an index.
- ✓DISTINCT excludes NULL from its deduplication — NULL is not considered equal to NULL for DISTINCT purposes. If your column has NULLs and you want them counted, use COALESCE to replace NULL with a sentinel value.
- ✓Before writing any WHERE condition that filters on a column you are unfamiliar with, run SELECT DISTINCT column FROM table first — this reveals all actual values and prevents WHERE conditions that silently match nothing due to unexpected formatting.
What comes next
In Module 11, you master NULL values completely — what NULL means, why it behaves differently from every other value, how it propagates through calculations and comparisons, and every technique for handling it correctly in your queries.
Module 11 → Working with NULL ValuesDiscussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.