SQL — Module 10Beginner

Removing Duplicates — DISTINCT

Return only unique values, understand how DISTINCT works across single and multiple columns, its performance cost, and when to use GROUP BY instead

7–10 min April 2026

Section 2 · Reading Data — SELECT

Reading Data — SELECT · 10 modulesModule 10

Your Filtering Multiple Sorting Limiting Removing Working Column Renaming Pattern

// Part 01

The Problem DISTINCT Solves

By default, SELECT returns every row that satisfies your WHERE condition — including duplicates. If ten customers all live in Seattle, a query for cities returns "Seattle" ten times. If thirty orders were placed across five stores, a query for store IDs returns five store IDs across thirty rows — with many repeats.

Sometimes you want those repeats — when you are counting transactions, listing orders, or analysing every individual record. But sometimes you want to know the unique set of values — which cities does FreshCart serve, which categories of products exist, which payment methods have been used. This is what DISTINCT does: it eliminates duplicate rows from your result, returning each unique value exactly once.

Loading FreshCart DB…
-- Without DISTINCT: every customer's city, with repeats
-- 20 rows returned — many cities appear multiple times
SELECT city
FROM customers
ORDER BY city;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- With DISTINCT: each city appears exactly once
-- Much smaller result — the unique set of cities FreshCart serves
SELECT DISTINCT city
FROM customers
ORDER BY city;Ctrl + Enter to run
Loading FreshCart database in your browser…

Same table, same column, one keyword difference — completely different results. The first query answers "what city is each customer in?" The second answers "which cities do our customers come from?"

// Part 02

How DISTINCT Works Internally

Understanding what DISTINCT does inside the database helps you predict its performance and know when to use it.

The deduplication process

When the database executes a DISTINCT query, it collects all rows that would normally be returned, then eliminates any row whose combination of column values has already been seen. The result is the unique set of rows. Internally, the database does this using one of two mechanisms:

Sorting: The database sorts all rows by the SELECT columns. Identical rows end up adjacent, making duplicates easy to identify and skip. This is why DISTINCT queries often show sorted output in practice — though this is a side effect of the implementation, not a guarantee.

Hashing: The database computes a hash of each row's values and uses a hash table to track which combinations have been seen. When a row's hash matches an existing entry, it is a duplicate and is discarded. Hashing avoids the sort step and is often faster when the result fits in memory.

Both approaches require processing every row in the result before any rows can be returned — DISTINCT cannot return the first row until it has seen all rows, because the first row might turn out to be a duplicate of the last row. This is why DISTINCT has a cost that grows with the number of input rows and cannot return results early the way LIMIT can.

Where DISTINCT fits in execution order

DISTINCT is applied after SELECT but before ORDER BY. The execution order for a query with DISTINCT is: FROM → WHERE → SELECT (project columns) → DISTINCT (eliminate duplicates) → ORDER BY → LIMIT. This means DISTINCT operates on the projected columns — the set of columns you listed in SELECT — not on all columns in the table.

// Part 03

DISTINCT on a Single Column

The most common use of DISTINCT is finding the unique values in one column — all distinct cities, all distinct categories, all distinct statuses. This is often called finding the domain or cardinality of a column.

Loading FreshCart DB…
-- All distinct cities where FreshCart has customers
SELECT DISTINCT city
FROM customers
ORDER BY city;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- All distinct product categories
SELECT DISTINCT category
FROM products
ORDER BY category;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- All distinct order statuses — what states can an order be in?
SELECT DISTINCT order_status
FROM orders
ORDER BY order_status;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- All distinct payment methods used
SELECT DISTINCT payment_method
FROM orders
ORDER BY payment_method;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- All distinct loyalty tiers
SELECT DISTINCT loyalty_tier
FROM customers
ORDER BY loyalty_tier;Ctrl + Enter to run
Loading FreshCart database in your browser…

These five queries are the most useful exploratory queries when joining a new project. Before you write any filtering queries on these columns, run DISTINCT to know exactly what values exist in the data — including any unexpected values, typos, or formatting inconsistencies that would cause your WHERE conditions to miss rows.

// Part 04

DISTINCT on Multiple Columns

When you list multiple columns in a DISTINCT query, the database returns each unique combination of those columns — not just unique values in each column independently. A row is a duplicate only if every column in the SELECT list has an identical value.

Loading FreshCart DB…
-- Each unique city + loyalty_tier combination
-- Not just unique cities, not just unique tiers —
-- each distinct PAIR that exists in the data
SELECT DISTINCT city, loyalty_tier
FROM customers
ORDER BY city, loyalty_tier;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Each unique store + order_status combination
-- Which stores have had which order statuses?
SELECT DISTINCT store_id, order_status
FROM orders
ORDER BY store_id, order_status;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Each unique category + sub_category combination
-- The full product taxonomy tree
SELECT DISTINCT category, sub_category
FROM products
ORDER BY category, sub_category;Ctrl + Enter to run
Loading FreshCart database in your browser…

The combination rule in practice

Consider DISTINCT city, loyalty_tier on a customers table with 20 rows. There are 7 distinct cities and 4 distinct tiers. The number of distinct combinations is NOT 7 + 4 = 11. It is however many unique city-tier pairs actually appear in the data — some cities might have customers at all four tiers, others might only have Bronze and Silver customers. DISTINCT returns only the combinations that genuinely exist.

💡 Note

DISTINCT applies to the entire SELECT list as a unit. You cannot write SELECT DISTINCT city, first_name and expect DISTINCT to only deduplicate on city while returning all first_names. DISTINCT always operates on the full combination of all listed columns. If you want unique cities with one representative name per city, you need GROUP BY — covered in Module 28.

// Part 05

DISTINCT with WHERE and ORDER BY

DISTINCT works seamlessly with WHERE and ORDER BY. WHERE filters rows before DISTINCT processes them — so DISTINCT only sees and deduplicates the rows that passed the filter. ORDER BY sorts the unique result set after deduplication.

Loading FreshCart DB…
-- Distinct cities where Gold or Platinum customers live
-- WHERE filters first, DISTINCT deduplicates the filtered rows
SELECT DISTINCT city
FROM customers
WHERE loyalty_tier IN ('Gold', 'Platinum')
ORDER BY city;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Distinct payment methods used for high-value orders (above ₹1000)
SELECT DISTINCT payment_method
FROM orders
WHERE total_amount > 1000
ORDER BY payment_method;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Distinct stores that have had cancelled or returned orders
-- Operations team uses this to identify stores with fulfilment issues
SELECT DISTINCT store_id
FROM orders
WHERE order_status = 'Cancelled'
   OR order_status = 'Returned'
ORDER BY store_id;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Distinct categories of in-stock products under ₹100
-- What affordable product categories are available?
SELECT DISTINCT category
FROM products
WHERE in_stock = true
  AND unit_price < 100
ORDER BY category;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 06

COUNT DISTINCT — Counting Unique Values

One of the most common analytical questions is not "what are the unique values?" but "how many unique values are there?" For this, SQL provides COUNT(DISTINCT column) — it counts the number of distinct non-null values in a column.

Loading FreshCart DB…
-- How many distinct cities do our customers come from?
SELECT COUNT(DISTINCT city) AS distinct_cities
FROM customers;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- How many distinct customers have placed orders?
-- Not the count of orders — the count of unique customer_ids
SELECT COUNT(DISTINCT customer_id) AS customers_who_ordered
FROM orders;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- How many distinct products have been ordered?
SELECT COUNT(DISTINCT product_id) AS products_ordered
FROM order_items;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Compare: total orders vs distinct customers who ordered
-- The difference tells you how many customers are repeat buyers
SELECT
  COUNT(*)                    AS total_orders,
  COUNT(DISTINCT customer_id) AS unique_customers,
  COUNT(*) - COUNT(DISTINCT customer_id) AS repeat_order_count
FROM orders;Ctrl + Enter to run
Loading FreshCart database in your browser…

COUNT(*) vs COUNT(DISTINCT column) vs COUNT(column)

Expression	What it counts	NULL handling
COUNT(*)	Every row, regardless of values	Counts rows even if every column is NULL
COUNT(column)	Rows where column is NOT NULL	NULLs are excluded from the count
COUNT(DISTINCT column)	Unique non-NULL values in column	NULLs are excluded; only one NULL would be counted anyway

Loading FreshCart DB…
-- Demonstrate all three COUNT variations on orders
SELECT
  COUNT(*)                     AS total_rows,
  COUNT(delivery_date)         AS rows_with_delivery_date,
  COUNT(DISTINCT customer_id)  AS unique_customers
FROM orders;
-- total_rows > rows_with_delivery_date
-- because some orders have NULL delivery_date (not yet delivered)Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 07

DISTINCT vs GROUP BY — When to Use Which

DISTINCT and GROUP BY both return unique combinations of values. For simple deduplication, they produce identical results. But they serve different purposes and have different capabilities.

DISTINCT and GROUP BY producing the same result

-- Both return the same list of unique cities
SELECT DISTINCT city
FROM customers
ORDER BY city;

-- Equivalent with GROUP BY
SELECT city
FROM customers
GROUP BY city
ORDER BY city;

When they diverge is when you want to calculate something per unique value. DISTINCT cannot do this — it only removes duplicates. GROUP BY can aggregate: count how many customers per city, sum revenue per store, find the average price per category. You will learn GROUP BY fully in Module 28, but here is the key distinction:

DISTINCT vs GROUP BY — the key difference

-- DISTINCT: unique cities only — no aggregation possible
SELECT DISTINCT city
FROM customers;

-- GROUP BY: unique cities WITH a count of customers in each
-- This is impossible with DISTINCT alone
SELECT city, COUNT(*) AS customer_count
FROM customers
GROUP BY city
ORDER BY customer_count DESC;

-- Rule of thumb:
-- Just unique values → DISTINCT
-- Unique values + any calculation → GROUP BY

Performance comparison

For simple deduplication, DISTINCT and GROUP BY have similar performance — both require the database to process all rows and identify unique combinations. GROUP BY often has a slight edge because it is more directly optimised in most database engines. For large tables, if you only need unique values with no aggregation, both are valid — but GROUP BY is preferred in professional code because it is more expressive and extensible (you can add COUNT or SUM later without rewriting the query structure).

🎯 Pro Tip

In production code, prefer GROUP BY over DISTINCT for deduplication when working with large tables — it is more explicit about intent, easier to extend with aggregations, and often slightly faster. Use DISTINCT for quick exploration and profiling, and when the query is simple enough that GROUP BY would add unnecessary verbosity.

// Part 08

DISTINCT and Performance — The Hidden Cost

DISTINCT is not free. Before returning any row, the database must process all rows in the result set and eliminate duplicates. On small tables this is imperceptible. On tables with millions of rows, DISTINCT can be significantly slower than a plain SELECT — and significantly slower than a well-designed GROUP BY with an index.

When DISTINCT is expensive

DISTINCT requires a sort or hash of the entire projected result set. If 5 million rows pass through WHERE, DISTINCT must hash or sort all 5 million before returning any. Memory usage grows with result size — large DISTINCT operations may spill to disk, causing further slowdown.

When DISTINCT is cheap

If the column being deduplicated has an index, the database can often use an index scan to find unique values without processing every row. For a column with 7 distinct values in a 10-million-row table, the database can scan just the index (far smaller than the table) and return 7 values almost instantly. For columns with few distinct values relative to total rows (low cardinality — like city, status, category), DISTINCT is fast even on large tables.

DISTINCT as a debugging smell

Experienced SQL writers know that DISTINCT in a complex query — especially a query with JOINs — is often a sign that something else is wrong. If a JOIN is producing more rows than expected (a fan-out from a one-to-many relationship), adding DISTINCT might mask the problem rather than fix it. Before reaching for DISTINCT, ask: why are there duplicates? If the answer is "my JOIN is returning more rows than I expect," fix the JOIN rather than hiding the extra rows with DISTINCT.

⚠️ Important

SELECT DISTINCT in a query with JOIN is a red flag. It often means the JOIN is creating a cartesian product or joining on a non-unique key, producing duplicate rows. DISTINCT hides this but does not fix it — and it adds a significant performance cost. Investigate the source of duplicates and fix the JOIN condition instead. You will learn JOIN in depth in Modules 30–35.

// Part 09

Practical DISTINCT Patterns — Real Business Uses

These are the DISTINCT patterns you will write most frequently in real analytics work.

Schema exploration — what values exist in this column?

Loading FreshCart DB…
-- Before filtering on a column, always check what values exist
-- This prevents WHERE conditions that match nothing
SELECT DISTINCT department
FROM employees
ORDER BY department;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- What brands does FreshCart stock?
SELECT DISTINCT brand
FROM products
ORDER BY brand;Ctrl + Enter to run
Loading FreshCart database in your browser…

Reach analysis — which entities touched a segment?

Loading FreshCart DB…
-- Which stores received at least one UPI order?
-- Distinct stores, filtered by payment method
SELECT DISTINCT store_id
FROM orders
WHERE payment_method = 'UPI'
ORDER BY store_id;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Which customers have placed at least one order?
-- (Some customers may have never ordered)
SELECT DISTINCT customer_id
FROM orders
ORDER BY customer_id;Ctrl + Enter to run
Loading FreshCart database in your browser…

Data quality checks — find unexpected values

Loading FreshCart DB…
-- Check all distinct order statuses
-- If anything other than Delivered/Processing/Cancelled/Returned appears,
-- there is a data quality issue
SELECT DISTINCT order_status
FROM orders
ORDER BY order_status;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Check all distinct states — any unexpected abbreviations or typos?
SELECT DISTINCT state
FROM customers
ORDER BY state;Ctrl + Enter to run
Loading FreshCart database in your browser…

Cardinality profiling — how many unique values?

Loading FreshCart DB…
-- Profile the cardinality of key columns
-- Low cardinality = good candidate for a partial index
-- High cardinality = good candidate for a B-tree index
SELECT
  COUNT(DISTINCT city)           AS distinct_cities,
  COUNT(DISTINCT loyalty_tier)   AS distinct_tiers,
  COUNT(DISTINCT state)          AS distinct_states,
  COUNT(*)                       AS total_customers
FROM customers;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 10

What This Looks Like at Work

You are a data analyst at Instacart, the quick commerce startup. The product team is preparing a feature that lets customers filter products by brand. Before the engineering team builds the filter UI, they need to know exactly which brands exist in the product catalogue — the complete, deduplicated list with no repeats.

11:00 AM

Request arrives

The product manager sends a message: "We need the list of all distinct brands in our catalogue for the filter dropdown. Should be alphabetical. Also, how many distinct brands do we have total? And which brands appear in more than one category — those need a special multi-category indicator in the UI."

11:08 AM

Query 1 — complete distinct brand list

The dropdown list itself — every brand exactly once, alphabetically.

Loading FreshCart DB…
-- All distinct brands for the filter dropdown
SELECT DISTINCT brand
FROM products
WHERE in_stock = true
ORDER BY brand;Ctrl + Enter to run
Loading FreshCart database in your browser…

11:12 AM

Query 2 — total brand count

The product manager asked for a total. One line.

Loading FreshCart DB…
-- How many distinct brands in the catalogue?
SELECT COUNT(DISTINCT brand) AS total_brands
FROM products
WHERE in_stock = true;Ctrl + Enter to run
Loading FreshCart database in your browser…

11:15 AM

Query 3 — brands across multiple categories

Brands that appear in more than one category need a "multi-category" flag in the UI. This requires COUNT DISTINCT per brand — a GROUP BY query (preview of Module 28).

Loading FreshCart DB…
-- Brands that appear in more than one category
-- Uses GROUP BY + HAVING (preview of Modules 28-29)
SELECT brand, COUNT(DISTINCT category) AS category_count
FROM products
WHERE in_stock = true
GROUP BY brand
HAVING COUNT(DISTINCT category) > 1
ORDER BY category_count DESC, brand;Ctrl + Enter to run
Loading FreshCart database in your browser…

11:25 AM

All three delivered in 25 minutes

The product manager shares the results with engineering and design. The dropdown list, the count for the filter header ("Filter by Brand — 12 brands"), and the multi-category flag list. Three precise queries, three answers, zero back-and-forth. The engineering team starts building the filter the same afternoon.

🎯 Pro Tip

DISTINCT is one of the most useful tools for data profiling — understanding a new dataset before writing production queries on it. When you join a new project, spend an hour running SELECT DISTINCT on every important column in every important table. You will find unexpected values, inconsistent capitalisation, typos, deprecated statuses, and missing data — all of which affect every query you will write on that column. Discovering these issues before writing business logic saves hours of debugging later.

// Part 11

Interview Prep — 5 Questions With Complete Answers

Q: What does SELECT DISTINCT do and how does it differ from a plain SELECT?

SELECT DISTINCT returns only unique rows — it eliminates duplicate rows from the result set before returning them to the caller. A plain SELECT returns every row that satisfies the WHERE condition, including duplicates. The deduplication applies to the complete combination of all columns listed in SELECT: a row is considered a duplicate only if every column in the SELECT list has an identical value to another row.

The practical difference: SELECT city FROM customers returns one row per customer — 20 rows if there are 20 customers, with cities repeated for customers in the same city. SELECT DISTINCT city FROM customers returns one row per unique city — 7 rows if customers are distributed across 7 cities, regardless of how many customers are in each.

Internally, DISTINCT requires the database to process all result rows and eliminate duplicates before returning any. This is done through sorting (duplicates become adjacent) or hashing (track seen values in a hash table). Both approaches require processing the full result set, making DISTINCT a blocking operation — it cannot return partial results early the way LIMIT can. The cost grows with the number of input rows and the number of columns in the SELECT list.

Q: What is the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?

COUNT(*) counts every row in the result set regardless of the values in any column. It includes rows where all columns are NULL. It is the correct function for counting how many rows a query returns — total orders, total customers, total products.

COUNT(column) counts rows where the specified column is NOT NULL. If delivery_date is NULL for undelivered orders, COUNT(delivery_date) counts only the delivered orders — those where delivery_date has a real value. This makes COUNT(column) useful for counting non-missing values: how many orders have been assigned a delivery date, how many employees have a specified manager.

COUNT(DISTINCT column) counts the number of unique non-null values in the specified column. COUNT(DISTINCT customer_id) from the orders table counts how many distinct customers have placed at least one order — not how many total orders, and not including customers who have never ordered. This is the cardinality question: how many unique values exist. A common analytical pattern combining all three: SELECT COUNT(*) AS total_orders, COUNT(delivery_date) AS delivered_orders, COUNT(DISTINCT customer_id) AS unique_customers FROM orders — each answers a different question about the same table.

Q: How does DISTINCT behave when applied to multiple columns?

When multiple columns are listed in a SELECT DISTINCT query, DISTINCT applies to the full combination of all listed columns — not to each column independently. A row is eliminated as a duplicate only if every column in the SELECT list has an identical value to another row. If any one column differs, the row is considered unique and is included in the result.

Concrete example: SELECT DISTINCT city, loyalty_tier FROM customers. The result contains every unique city-tier pair that exists in the data. If Seattle has customers at Gold and Platinum tiers, two rows appear: (Seattle, Gold) and (Seattle, Platinum). If Austin only has Silver customers, one row appears: (Austin, Silver). The total number of result rows is the count of unique combinations — not the sum of distinct values in each column independently.

This combination behaviour is important to understand because it means adding more columns to a DISTINCT query increases the number of rows returned (or keeps it the same — never decreases it). If every combination of city and loyalty_tier is unique, SELECT DISTINCT city, loyalty_tier returns as many rows as SELECT DISTINCT city. Only if multiple rows share the exact same city AND loyalty_tier does DISTINCT reduce the count. The more columns you add, the more specific the combination and the fewer rows get eliminated as duplicates.

Q: When would you use DISTINCT vs GROUP BY?

Both DISTINCT and GROUP BY can return unique combinations of column values — for simple deduplication they produce identical results. The choice depends on whether you need to calculate anything per unique combination.

Use DISTINCT when you only need the unique values themselves with no aggregation: SELECT DISTINCT city FROM customers. It is concise and communicates intent clearly — you want the unique set of cities, nothing more. DISTINCT is also the appropriate choice in COUNT(DISTINCT column) expressions inside aggregate queries.

Use GROUP BY when you need unique values plus any calculation per group: SELECT city, COUNT(*) AS customer_count FROM customers GROUP BY city. DISTINCT cannot perform this — it only eliminates duplicates, it does not aggregate. GROUP BY is also preferred in production code for large tables because it is more directly optimised in most database engines and is more extensible — you can add SUM, AVG, or MAX columns without changing the query structure. A practical rule: if the query only has DISTINCT with no aggregate functions, GROUP BY is an equally valid and often preferable alternative. If you need aggregation per group, GROUP BY is the only option.

Q: Why is DISTINCT in a query with JOINs considered a warning sign?

DISTINCT in a JOIN query is a warning sign because it often indicates that the JOIN is producing more rows than intended — and DISTINCT is being used to hide the problem rather than fix it. The most common cause is a one-to-many JOIN that fans out rows: if you join customers to orders on customer_id, and a customer has 5 orders, that customer's row appears 5 times in the result. Adding DISTINCT collapses those 5 rows back to 1 — but you have also lost the information that there were 5 orders, and you are paying the cost of both the fan-out and the deduplication.

The correct fix depends on what you actually want. If you want one row per customer with an order count, use GROUP BY: SELECT customer_id, COUNT(order_id) AS order_count FROM customers JOIN orders USING (customer_id) GROUP BY customer_id. If you want customers who have placed at least one order (existence check), use EXISTS: SELECT customer_id FROM customers WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.customer_id). Both are more correct and more efficient than joining and then applying DISTINCT.

The general principle: when you find yourself adding DISTINCT to remove unexpected duplicates, stop and investigate why the duplicates exist. The answer is almost always a JOIN issue — wrong join column, missing join condition creating a cartesian product, or a one-to-many relationship producing more rows than expected. Fixing the root cause gives you correct results with better performance. DISTINCT on top of a broken JOIN gives you correct-looking results that hide a performance problem and a misunderstood data model.

// Part 12

Errors You Will Hit — And Exactly Why They Happen

SELECT DISTINCT returns more rows than expected — expected 5 unique cities but got 12 rows

Cause: DISTINCT is applied to the combination of ALL columns in SELECT, not just the first column. If your SELECT list includes multiple columns — SELECT DISTINCT city, loyalty_tier — the result contains unique city-tier pairs, not just unique cities. If you have 7 cities but some cities have customers at multiple loyalty tiers, the number of unique (city, tier) combinations is larger than the number of unique cities alone.

Fix: Reduce your SELECT list to only the columns you want uniqueness on. To get unique cities only: SELECT DISTINCT city FROM customers. If you also need loyalty_tier in the output but want one row per city, use GROUP BY with an aggregate to select which tier to show: SELECT city, MAX(loyalty_tier) FROM customers GROUP BY city — or rethink what 'unique' means for your use case. DISTINCT is not a column-level operation — it always applies to the full row.

DISTINCT query is very slow — takes 45 seconds on a large table

Cause: DISTINCT requires processing all rows in the result before returning any, using either a sort or hash-based deduplication. On a large table with millions of rows and no index on the SELECT columns, this is an expensive O(n log n) or O(n) operation. If the table has no index covering the DISTINCT columns, the database performs a full table scan followed by a full sort or hash of all rows.

Fix: Three approaches in order of preference. First, add an index on the columns used in DISTINCT — the database can then scan the index (much smaller than the table) to find unique values. Second, rewrite as GROUP BY which may use a more efficient execution plan: SELECT city FROM customers GROUP BY city. Third, if you need COUNT(DISTINCT column), consider caching or pre-computing the value in a summary table if it is queried frequently. Use EXPLAIN ANALYZE to see the actual execution plan and confirm whether an index is being used.

COUNT(DISTINCT column) returns a different number than expected — lower than the manual count

Cause: COUNT(DISTINCT column) excludes NULL values from its count. If the column contains NULL values, those rows are not counted. For example, if the brand column is NULL for 3 products and there are 12 non-null distinct brands, COUNT(DISTINCT brand) returns 12, not 15. Additionally, values that appear to be different but are actually the same due to trailing spaces or case differences ('Amul' vs 'amul') may count as two distinct values when you expected one.

Fix: First, check for NULLs: SELECT COUNT(*) FROM products WHERE brand IS NULL — this shows how many NULL brand rows exist. If NULLs should be counted as a distinct value, add a COALESCE: COUNT(DISTINCT COALESCE(brand, 'Unknown')). Second, check for case or whitespace inconsistencies: SELECT DISTINCT LOWER(TRIM(brand)) FROM products — this normalises the values and shows what you actually have. If inconsistencies exist, fix the data at source or use COUNT(DISTINCT LOWER(TRIM(brand))) in your query.

DISTINCT removed rows that should be different — two orders with the same data are collapsed into one

Cause: DISTINCT considers two rows identical if every column in the SELECT list has the same value. If your SELECT list does not include a unique identifier (like order_id), two genuinely different orders with identical visible data — same customer, same date, same amount — will be collapsed into one row by DISTINCT. This is a data loss bug, not a DISTINCT bug.

Fix: Include a unique identifier column (primary key) in your SELECT list when using DISTINCT on transactional data. SELECT DISTINCT order_id, customer_id, total_amount ensures each order is unique because order_id is unique. If including the primary key defeats the purpose of your DISTINCT (you wanted to find duplicates at a different level of granularity), reconsider whether DISTINCT is the right tool — GROUP BY with HAVING COUNT(*) > 1 is better for finding genuine duplicates.

SELECT DISTINCT with ORDER BY error — ORDER BY column must appear in SELECT list

Cause: In some databases (notably older MySQL versions and some SQL Server configurations), when using SELECT DISTINCT, all columns in ORDER BY must also appear in the SELECT list. This is logically required: if DISTINCT is deduplicating on specific columns, sorting by a column not in the SELECT list would produce an ambiguous result — the database does not know which instance of the duplicate value to use for sorting.

Fix: Add the ORDER BY column to the SELECT list: instead of SELECT DISTINCT city FROM customers ORDER BY loyalty_tier — which fails because loyalty_tier is not in the SELECT — use SELECT DISTINCT city, loyalty_tier FROM customers ORDER BY loyalty_tier, city. If you do not want loyalty_tier in the output, use GROUP BY instead: SELECT city FROM customers GROUP BY city ORDER BY MIN(loyalty_tier) — GROUP BY allows ordering by an aggregate of a non-selected column.

Try It Yourself

The FreshCart marketing team wants to understand their customer reach across India. Write three queries: (1) All distinct states where FreshCart has customers, sorted alphabetically. (2) The total count of distinct cities across all customers. (3) All distinct store-city combinations — which city each store is in — sorted by city.

🎯 Key Takeaways

✓SELECT DISTINCT eliminates duplicate rows from the result. Each unique combination of the listed columns appears exactly once.
✓DISTINCT applies to the full combination of ALL columns in SELECT — not to individual columns independently. Adding more columns to SELECT DISTINCT increases or maintains the result count, never decreases it.
✓DISTINCT requires processing all result rows before returning any. It cannot return partial results early. On large tables this is expensive — always check whether an index covers the DISTINCT columns.
✓COUNT(DISTINCT column) counts unique non-null values in a column. COUNT(*) counts all rows. COUNT(column) counts non-null values. All three answer different questions.
✓DISTINCT and GROUP BY produce the same result for simple deduplication. Use DISTINCT for quick deduplication with no aggregation. Use GROUP BY when you need unique values plus any calculation (COUNT, SUM, AVG) per group.
✓DISTINCT in a JOIN query is a warning sign — it usually means the JOIN is producing unexpected duplicates. Investigate and fix the JOIN rather than hiding duplicates with DISTINCT.
✓The most valuable use of DISTINCT in professional work is schema exploration: run SELECT DISTINCT on key columns in a new table to discover all existing values, including unexpected typos, casing inconsistencies, and deprecated statuses.
✓Low-cardinality columns (few distinct values like status, category, tier) are cheap to DISTINCT on. High-cardinality columns (many distinct values like email, order_id) are expensive without an index.
✓DISTINCT excludes NULL from its deduplication — NULL is not considered equal to NULL for DISTINCT purposes. If your column has NULLs and you want them counted, use COALESCE to replace NULL with a sentinel value.
✓Before writing any WHERE condition that filters on a column you are unfamiliar with, run SELECT DISTINCT column FROM table first — this reveals all actual values and prevents WHERE conditions that silently match nothing due to unexpected formatting.

What comes next

In Module 11, you master NULL values completely — what NULL means, why it behaves differently from every other value, how it propagates through calculations and comparisons, and every technique for handling it correctly in your queries.

Module 11 → Working with NULL Values

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub