SQL — Module 41Intermediate

String Functions

Every text manipulation tool — concatenation, case, length, trimming, substrings, replacement, padding, splitting, and pattern matching for real data cleaning

14–18 min April 2026

Section 9 · SQL Functions

SQL Functions · 4 modulesModule 41

String Date Math Type

// Part 01

Why String Functions Matter in Real Data

In an ideal world, every string column is clean, consistently formatted, and ready to query. In production, data arrives from mobile apps, web forms, CSV imports, partner APIs, and legacy systems — each with its own formatting quirks. Customer names with extra spaces, email addresses in mixed case, phone numbers formatted a dozen different ways, product codes with inconsistent separators.

String functions are the tools that standardise, clean, extract, and transform text data at query time — without modifying the underlying rows. They are essential for data quality checks, display formatting, join key normalisation (so 'bangalore' matches 'Seattle'), and extracting structured information from unstructured text fields.

// Part 02

Concatenation — Joining Strings Together

The || operator and CONCAT()

Concatenation syntax — two approaches

-- PostgreSQL and DuckDB: || operator (SQL standard)
'Hello' || ' ' || 'World'     -- result: 'Hello World'

-- MySQL and most databases: CONCAT() function
CONCAT('Hello', ' ', 'World') -- result: 'Hello World'

-- Both work in DuckDB (this playground)
-- NULL propagation: NULL || 'anything' = NULL
-- CONCAT handles NULLs differently in MySQL (treats as empty string)

-- Safe concatenation with COALESCE:
COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')

Loading FreshCart DB…
-- Build full names, store labels, and email displays
SELECT
  first_name || ' ' || last_name                    AS full_name,
  UPPER(LEFT(first_name, 1)) || '. ' || last_name   AS abbreviated,
  city || ', ' || state                             AS location,
  first_name || ' <' || email || '>'               AS email_display
FROM customers
ORDER BY last_name, first_name
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- CONCAT_WS: concatenate with separator — skips NULLs automatically
-- (Concat With Separator)
SELECT
  store_id,
  store_name,
  -- Build a full address, skipping NULL parts
  CONCAT_WS(', ', store_name, city, state)   AS full_address
FROM stores
ORDER BY store_id;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 03

Case Functions — UPPER, LOWER, INITCAP

Loading FreshCart DB…
-- Case normalisation — essential for joins and comparisons
SELECT
  product_name,
  UPPER(product_name)       AS upper_name,
  LOWER(product_name)       AS lower_name,
  LOWER(brand)              AS normalised_brand,
  -- INITCAP: Title Case (first letter of each word capitalised)
  INITCAP(LOWER(product_name)) AS title_case
FROM products
ORDER BY category
LIMIT 10;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Case-insensitive join using LOWER()
-- Without LOWER: 'Seattle' != 'bangalore' — join misses rows
-- With LOWER: both normalised, join works correctly
SELECT
  c.customer_id,
  c.first_name,
  c.city,
  COUNT(o.order_id)  AS order_count
FROM customers AS c
JOIN orders AS o ON c.customer_id = o.customer_id
WHERE LOWER(c.city) IN ('bangalore', 'hyderabad', 'mumbai')
GROUP BY c.customer_id, c.first_name, c.city
ORDER BY order_count DESC;Ctrl + Enter to run
Loading FreshCart database in your browser…

🎯 Pro Tip

Always normalise case before comparing or joining on string columns. WHERE LOWER(city) = 'bangalore' correctly matches 'Seattle', 'BANGALORE', and 'bangalore'. Without normalisation, a single inconsistency in the data breaks the match silently — no error, just missing rows.

// Part 04

Length Functions — Measuring Strings

Loading FreshCart DB…
-- LENGTH / CHAR_LENGTH: number of characters
-- OCTET_LENGTH: number of bytes (differs for multi-byte UTF-8)
SELECT
  product_name,
  LENGTH(product_name)        AS char_length,
  LENGTH(TRIM(product_name))  AS trimmed_length,
  brand,
  LENGTH(brand)               AS brand_length
FROM products
ORDER BY char_length DESC
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Data quality: find suspiciously short or long values
SELECT
  customer_id,
  first_name,
  LENGTH(first_name)   AS name_len,
  email,
  LENGTH(email)        AS email_len
FROM customers
WHERE LENGTH(first_name) < 2        -- suspiciously short names
   OR LENGTH(email) < 6             -- suspiciously short emails
   OR email NOT LIKE '%@%'          -- missing @ — invalid email
ORDER BY name_len;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 05

Trimming — Removing Whitespace and Characters

Leading and trailing whitespace is one of the most common data quality issues — especially in data imported from CSV files or form submissions. TRIM removes it cleanly. Untrimmed strings cause join mismatches and incorrect comparisons silently.

TRIM variants

-- TRIM: removes leading AND trailing whitespace (default)
TRIM('  hello world  ')           -- 'hello world'

-- LTRIM: removes leading whitespace only
LTRIM('  hello world  ')          -- 'hello world  '

-- RTRIM: removes trailing whitespace only
RTRIM('  hello world  ')          -- '  hello world'

-- TRIM with specific character (remove specific chars not just spaces)
TRIM('.' FROM '...hello...')      -- 'hello'
TRIM(BOTH '0' FROM '00042000')    -- '42'   (removes leading/trailing zeros)
TRIM(LEADING '0' FROM '00042000') -- '42000' (removes only leading zeros)

-- Practical: normalise before joining
WHERE LOWER(TRIM(city)) = 'bangalore'

Loading FreshCart DB…
-- Demonstrate TRIM in action
SELECT
  '  Amul Butter  '                       AS raw_value,
  TRIM('  Amul Butter  ')                 AS trimmed,
  LENGTH('  Amul Butter  ')              AS raw_length,
  LENGTH(TRIM('  Amul Butter  '))         AS trimmed_length,
  -- Practical: trim and normalise product names
  INITCAP(LOWER(TRIM(product_name)))      AS clean_name,
  LENGTH(product_name) - LENGTH(TRIM(product_name)) AS whitespace_count
FROM products
LIMIT 6;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Data quality audit: find values with leading/trailing whitespace
SELECT
  customer_id,
  first_name,
  '>' || first_name || '<'                AS raw_with_markers,
  LENGTH(first_name) - LENGTH(TRIM(first_name)) AS extra_whitespace
FROM customers
WHERE first_name != TRIM(first_name)      -- has leading or trailing spaces
   OR last_name  != TRIM(last_name)
ORDER BY customer_id;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 06

Substring Functions — Extracting Parts of Strings

SUBSTRING / SUBSTR→ portion of string

SUBSTRING(str, start_pos, length) -- 1-indexed

Extract a portion starting at start_pos for length characters. Negative start not supported in standard SQL.

LEFT→ first N characters

LEFT(str, n)

Returns the first n characters. LEFT('Seattle', 3) = 'Ban'

RIGHT→ last N characters

RIGHT(str, n)

Returns the last n characters. RIGHT('ST001', 3) = '001'

SPLIT_PART→ Nth segment after splitting

SPLIT_PART(str, delimiter, n) -- 1-indexed

Splits by delimiter and returns the Nth segment. SPLIT_PART('a,b,c', ',', 2) = 'b'

Loading FreshCart DB…
-- Extracting parts of store IDs and product codes
SELECT
  store_id,
  LEFT(store_id, 2)                         AS prefix,
  RIGHT(store_id, 3)                        AS store_number,
  SUBSTRING(store_id, 3)                    AS number_part,
  -- Extract first letter of each word for initials
  LEFT(store_name, 1)                       AS first_letter,
  city
FROM stores
ORDER BY store_id;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- SPLIT_PART: extract parts of compound strings
SELECT
  email,
  SPLIT_PART(email, '@', 1)              AS username,
  SPLIT_PART(email, '@', 2)              AS domain,
  SPLIT_PART(SPLIT_PART(email, '@', 2), '.', 1) AS domain_name
FROM customers
ORDER BY domain
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Build customer initials and short codes
SELECT
  customer_id,
  first_name,
  last_name,
  LEFT(first_name, 1) || LEFT(last_name, 1)  AS initials,
  -- Short code: first 3 chars of last name + customer_id
  UPPER(LEFT(last_name, 3)) || customer_id::TEXT  AS short_code
FROM customers
ORDER BY last_name
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 07

String Search — POSITION, STRPOS, CONTAINS

Finding positions within strings

-- POSITION: find where a substring first appears (1-indexed, 0 = not found)
POSITION('@' IN 'user@gmail.com')          -- 5
POSITION('xyz' IN 'user@gmail.com')        -- 0 (not found)

-- STRPOS (PostgreSQL/DuckDB): same as POSITION, different syntax
STRPOS('user@gmail.com', '@')              -- 5

-- CONTAINS (DuckDB): returns TRUE/FALSE
CONTAINS('Seattle India', 'India')       -- true

-- Practical: extract everything before a delimiter
SUBSTRING(email, 1, POSITION('@' IN email) - 1)  -- username before @
SUBSTRING(email, POSITION('@' IN email) + 1)      -- domain after @

Loading FreshCart DB…
-- Use POSITION to extract email parts without SPLIT_PART
SELECT
  email,
  -- Username: everything before @
  SUBSTRING(email, 1, POSITION('@' IN email) - 1)   AS username,
  -- Domain: everything after @
  SUBSTRING(email, POSITION('@' IN email) + 1)       AS domain,
  -- Position of the @ sign
  POSITION('@' IN email)                              AS at_position
FROM customers
WHERE POSITION('@' IN email) > 0   -- only valid emails with @
ORDER BY domain
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 08

REPLACE and TRANSLATE — Substituting Characters

Loading FreshCart DB…
-- REPLACE: substitute all occurrences of a substring
SELECT
  product_name,
  REPLACE(product_name, ' ', '_')           AS underscored,
  REPLACE(LOWER(product_name), ' ', '-')    AS url_slug,
  -- Remove specific characters
  REPLACE(REPLACE(phone, ' ', ''), '-', '') AS cleaned_phone
FROM products
LIMIT 6;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Chaining REPLACE for data cleaning
SELECT
  email,
  -- Normalise: remove spaces, lowercase
  LOWER(REPLACE(email, ' ', ''))            AS clean_email,
  phone,
  -- Clean phone: remove spaces, dashes, parentheses
  REPLACE(REPLACE(REPLACE(REPLACE(
    phone, ' ', ''), '-', ''), '(', ''), ')', ''
  )                                         AS clean_phone
FROM customers
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

TRANSLATE — character-by-character substitution

-- TRANSLATE: replace each character in the FROM set with the corresponding
-- character in the TO set — one-to-one character mapping
TRANSLATE('Hello World', 'aeiou', '12345')  -- 'H2ll4 W4rld'
-- H→H, e→2, l→l, l→l, o→4, ' '→' ', W→W, o→4, r→r, l→l, d→d

-- Remove specific characters (TO is shorter than FROM — excess chars are deleted)
TRANSLATE('ST-001/A', '-/', '')             -- 'ST001A' (removes - and /)

-- Practical: clean phone numbers (remove all non-digit characters)
TRANSLATE(phone, '()-+ ', '')              -- removes (), -, +, space
-- Much cleaner than chaining multiple REPLACE calls

Loading FreshCart DB…
-- TRANSLATE for efficient multi-character cleaning
SELECT
  store_id,
  -- Remove all vowels from store names (illustrative)
  TRANSLATE(LOWER(store_name), 'aeiou', '')    AS consonants_only,
  -- Normalise store codes: remove non-alphanumeric
  TRANSLATE(store_id, '-/', '')                AS clean_code,
  city
FROM stores
ORDER BY store_id;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 09

Padding — LPAD and RPAD

Padding adds characters to reach a target length — essential for generating fixed-width codes, formatting report columns for alignment, and generating zero-padded IDs.

Loading FreshCart DB…
-- LPAD: pad on the left. RPAD: pad on the right
SELECT
  customer_id,
  -- Zero-padded ID: ensure 6 digits
  LPAD(customer_id::TEXT, 6, '0')           AS padded_id,
  first_name,
  -- Right-pad name for fixed-width report columns
  RPAD(first_name || ' ' || last_name, 25, ' ') || ' |' AS fixed_width_name,
  -- Left-pad order amounts with spaces for alignment
  LPAD(ROUND(total_amount::NUMERIC, 2)::TEXT, 10, ' ') AS aligned_amount
FROM customers AS c
LEFT JOIN (
  SELECT customer_id, MAX(total_amount) AS total_amount
  FROM orders WHERE order_status = 'Delivered'
  GROUP BY customer_id
) AS o ON c.customer_id = o.customer_id
ORDER BY c.customer_id
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Generate fixed-format store codes and report headers
SELECT
  store_id,
  -- Fixed-width store display
  '[' || RPAD(store_id, 6) || '] ' || RPAD(city, 12) || RPAD(store_name, 25) AS report_row
FROM stores
ORDER BY store_id;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 10

LIKE and Pattern Matching — Flexible String Filtering

LIKE with wildcards is the standard SQL pattern matching tool. % matches any sequence of characters (including none). _ matches exactly one character. For more powerful patterns, SIMILAR TO (PostgreSQL) and regular expressions provide regex capabilities.

LIKE wildcards

-- % = zero or more characters
-- _ = exactly one character

'Amul%'        -- starts with 'Amul'
'%Butter'      -- ends with 'Butter'
'%Milk%'       -- contains 'Milk' anywhere
'_mail.com'    -- any single char then 'mail.com'
'ST__1'        -- ST, any 2 chars, then 1

-- Case sensitivity: LIKE is case-sensitive in PostgreSQL
-- Use ILIKE for case-insensitive matching (PostgreSQL/DuckDB)
WHERE product_name ILIKE '%amul%'      -- matches 'Amul', 'AMUL', 'amul'
WHERE product_name LIKE '%Amul%'       -- matches only 'Amul' (case-sensitive)

-- Escape literal % or _: use ESCAPE clause
WHERE notes LIKE '%50%%' ESCAPE ''   -- contains literal '50%'

Loading FreshCart DB…
-- LIKE pattern filtering on FreshCart data
SELECT
  product_name,
  category,
  brand,
  unit_price
FROM products
WHERE product_name LIKE '%Milk%'          -- contains 'Milk'
   OR product_name LIKE 'Amul%'           -- starts with 'Amul'
   OR brand        LIKE '%ata%'           -- brand contains 'ata' (Tata)
ORDER BY brand, product_name;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- ILIKE: case-insensitive LIKE (PostgreSQL / DuckDB)
SELECT
  customer_id,
  first_name,
  last_name,
  email
FROM customers
WHERE email ILIKE '%gmail%'             -- any case gmail addresses
   OR first_name ILIKE 'a%'            -- names starting with A (any case)
ORDER BY first_name;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Find stores in cities matching a pattern
SELECT
  store_id,
  store_name,
  city
FROM stores
WHERE city LIKE '%a%'                  -- cities containing 'a'
  AND store_id LIKE 'ST0%'            -- standard store codes
ORDER BY city;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 11

Regular Expressions — Powerful Pattern Matching

Regular expressions (regex) provide full pattern matching power — validating formats, extracting structured data from free text, and finding complex patterns. PostgreSQL uses ~ for regex match and ~* for case-insensitive match. DuckDB uses REGEXP_MATCHES.

Regex operators and functions

-- PostgreSQL / DuckDB regex:
-- ~   : matches regex (case-sensitive)
-- ~*  : matches regex (case-insensitive)
-- !~  : does NOT match
-- !~* : does NOT match (case-insensitive)

-- Validate email format (simplified)
WHERE email ~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$'

-- Validate Indian mobile number: starts with 6-9, 10 digits total
WHERE phone ~ '^[6-9][0-9]{9}$'

-- Validate 6-digit zip_code
WHERE zip_code ~ '^[0-9]{6}$'

-- Extract with REGEXP_EXTRACT (DuckDB)
REGEXP_EXTRACT(text, pattern, group)

-- Replace with REGEXP_REPLACE
REGEXP_REPLACE(string, pattern, replacement)
-- Replace all non-digit characters in phone:
REGEXP_REPLACE(phone, '[^0-9]', '', 'g')  -- 'g' = global (all occurrences)

Loading FreshCart DB…
-- Regex-based data validation
SELECT
  customer_id,
  email,
  -- Valid email: has exactly one @, domain has a dot
  CASE WHEN email ~ '^[^@]+@[^@]+.[^@]+$'
       THEN '✓ Valid' ELSE '✗ Invalid' END AS email_status,
  phone,
  -- Valid Indian mobile: 10 digits starting with 6-9
  CASE WHEN REPLACE(REPLACE(phone, ' ', ''), '-', '')
            ~ '^[6-9][0-9]{9}$'
       THEN '✓ Valid' ELSE '✗ Check' END   AS phone_status
FROM customers
ORDER BY customer_id
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- REGEXP_REPLACE: clean phone numbers using regex
-- Remove everything that is not a digit
SELECT
  customer_id,
  phone                                                AS raw_phone,
  REGEXP_REPLACE(COALESCE(phone, ''), '[^0-9]', '', 'g') AS digits_only,
  LENGTH(REGEXP_REPLACE(COALESCE(phone, ''), '[^0-9]', '', 'g')) AS digit_count
FROM customers
ORDER BY customer_id
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 12

Aggregating Strings — STRING_AGG and ARRAY_AGG

Standard aggregates collapse values to a single number. STRING_AGG collapses multiple string values into a single concatenated string — essential for building comma-separated lists, generating labels, and creating readable summaries.

Loading FreshCart DB…
-- STRING_AGG: combine multiple values into one string
-- Syntax: STRING_AGG(column, separator ORDER BY col)
SELECT
  o.order_id,
  o.total_amount,
  -- List all products in this order as a comma-separated string
  STRING_AGG(p.product_name, ', ' ORDER BY p.product_name) AS products_in_order,
  COUNT(oi.item_id)                                        AS item_count
FROM orders AS o
JOIN order_items AS oi ON o.order_id   = oi.order_id
JOIN products    AS p  ON oi.product_id = p.product_id
WHERE o.order_status = 'Delivered'
GROUP BY o.order_id, o.total_amount
ORDER BY o.total_amount DESC
LIMIT 6;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Customer-level product preferences
SELECT
  c.customer_id,
  c.first_name || ' ' || c.last_name                            AS customer,
  c.loyalty_tier,
  STRING_AGG(DISTINCT p.category, ' | ' ORDER BY p.category)   AS categories_ordered,
  COUNT(DISTINCT p.category)                                    AS category_count
FROM customers AS c
JOIN orders      AS o  ON c.customer_id  = o.customer_id
JOIN order_items AS oi ON o.order_id     = oi.order_id
JOIN products    AS p  ON oi.product_id  = p.product_id
WHERE o.order_status = 'Delivered'
GROUP BY c.customer_id, c.first_name, c.last_name, c.loyalty_tier
ORDER BY category_count DESC
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 13

Real-World Data Cleaning Pipeline

In practice, string functions are combined into cleaning pipelines — multiple transformations chained together to take raw dirty data and produce standardised output ready for joining, reporting, or loading into a clean table.

Loading FreshCart DB…
-- Complete data cleaning pipeline for customer records
SELECT
  customer_id,

  -- Name: trim whitespace, title case
  INITCAP(LOWER(TRIM(first_name)))                 AS clean_first_name,
  INITCAP(LOWER(TRIM(last_name)))                  AS clean_last_name,

  -- Full name formatted consistently
  INITCAP(LOWER(TRIM(first_name))) || ' '
  || INITCAP(LOWER(TRIM(last_name)))               AS clean_full_name,

  -- Email: lowercase, trim
  LOWER(TRIM(email))                               AS clean_email,

  -- City: title case, trim
  INITCAP(LOWER(TRIM(city)))                       AS clean_city,

  -- Generate a URL-safe customer slug
  LOWER(
    REPLACE(
      TRIM(first_name) || '-' || TRIM(last_name),
      ' ', '-'
    )
  ) || '-' || customer_id::TEXT                    AS customer_slug,

  -- Zero-padded customer ID for display
  'CUST-' || LPAD(customer_id::TEXT, 6, '0')       AS display_id

FROM customers
ORDER BY customer_id
LIMIT 8;Ctrl + Enter to run
Loading FreshCart database in your browser…

Loading FreshCart DB…
-- Product name standardisation and code generation
SELECT
  product_id,
  product_name,

  -- Clean and standardise
  INITCAP(LOWER(TRIM(product_name)))               AS standard_name,

  -- URL slug for product pages
  LOWER(REGEXP_REPLACE(TRIM(product_name), '[^a-zA-Z0-9]', '-', 'g')) AS url_slug,

  -- Product code: category prefix + zero-padded ID
  UPPER(LEFT(category, 3)) || '-' || LPAD(product_id::TEXT, 4, '0') AS product_code,

  -- Short description: first 30 chars of name + category
  LEFT(product_name, 30) ||
  CASE WHEN LENGTH(product_name) > 30 THEN '…' ELSE '' END AS short_name,

  category,
  brand

FROM products
ORDER BY category, product_name
LIMIT 10;Ctrl + Enter to run
Loading FreshCart database in your browser…

// Part 14

What This Looks Like at Work

You are a data engineer at Shopify. A partner has delivered a CSV of 50,000 seller records — names are in inconsistent case, phone numbers have various formats, city names have typos and whitespace, and emails are uncleaned. Before loading into the sellers table, you write a cleaning query that standardises every string field.

9:00 AM

Raw data sample arrives

Sample shows: 'RAHUL sharma', ' New York ', 'rahul@GMAIL.COM', '91-98765-43210' — every field needs cleaning.

9:20 AM

Build the cleaning pipeline

Chain string functions to produce a clean output SELECT that can be wrapped in INSERT INTO sellers.

Loading FreshCart DB…
-- Simulated seller data cleaning pipeline
-- Using FreshCart employees as stand-in for seller records
SELECT
  employee_id                                        AS seller_id,

  -- Name cleaning
  INITCAP(LOWER(TRIM(first_name)))                   AS clean_first_name,
  INITCAP(LOWER(TRIM(last_name)))                    AS clean_last_name,
  INITCAP(LOWER(TRIM(first_name))) || ' '
    || INITCAP(LOWER(TRIM(last_name)))               AS full_name,

  -- Role standardisation
  INITCAP(LOWER(TRIM(role)))                         AS clean_role,
  UPPER(TRIM(department))                            AS clean_department,

  -- Generate seller handle: firstname.lastname (lowercase, no spaces)
  LOWER(TRIM(first_name)) || '.'
    || LOWER(REPLACE(TRIM(last_name), ' ', ''))      AS seller_handle,

  -- Display ID with prefix
  'SELL-' || LPAD(employee_id::TEXT, 5, '0')         AS display_id,

  -- Salary formatted with currency symbol
  '₹' || TO_CHAR(salary, 'FM999,999')                AS formatted_salary

FROM employees
ORDER BY department, last_name
LIMIT 10;Ctrl + Enter to run
Loading FreshCart database in your browser…

10:00 AM

Cleaning query reviewed and approved

The cleaning pipeline standardises all string fields consistently. The INSERT INTO sellers SELECT ... wraps this query and loads 50,000 clean records in under 3 seconds. Post-load validation confirms zero records with invalid email format or untrimmed whitespace.

🎯 Pro Tip

Always create a cleaning SELECT before writing the INSERT. Run the SELECT, spot-check 20 rows manually, and verify edge cases (names with special characters, NULL values, extremely short or long values). Only wrap in INSERT once the SELECT output looks correct. Cleaning bugs caught in SELECT cost nothing. Cleaning bugs discovered after INSERT cost a DELETE, a fix, and a re-insert.

// Part 15

Interview Prep — 5 Questions With Complete Answers

Q: How do you perform a case-insensitive string comparison in SQL?

Three approaches. First, normalise both sides using LOWER() or UPPER() before comparing: WHERE LOWER(city) = LOWER('bangalore'). This converts both values to the same case before the comparison, making it case-insensitive. Second, in PostgreSQL and DuckDB, use ILIKE instead of LIKE: WHERE city ILIKE 'bangalore' — ILIKE is the case-insensitive version of LIKE and supports wildcards. Third, for joins on string columns, normalise the join key: JOIN stores ON LOWER(TRIM(o.city)) = LOWER(TRIM(s.city)).

The most important practical application is join key normalisation. Two tables may store the same city as 'Seattle' and 'bangalore' — an INNER JOIN without normalisation would miss the match entirely. Using LOWER(TRIM(city)) on both sides ensures the join finds matches regardless of case or whitespace differences.

Performance consideration: applying LOWER() or TRIM() to a column in WHERE prevents the database from using an index on that column — because the index stores the original values, not the lowercased versions. For high-frequency case-insensitive searches on large tables, create a functional index: CREATE INDEX idx_customers_city_lower ON customers (LOWER(city)). Then WHERE LOWER(city) = 'bangalore' can use this index efficiently. Alternatively, enforce case normalisation at write time (store all city values in title case) so queries never need to apply LOWER().

Q: What is the difference between REPLACE and TRANSLATE?

REPLACE substitutes all occurrences of a specific substring with a replacement string: REPLACE(string, find_string, replace_string). It works on substrings of any length. REPLACE('Hello World', 'World', 'SQL') returns 'Hello SQL'. REPLACE can only handle one substitution pattern per call — to substitute multiple different patterns requires chaining multiple REPLACE calls.

TRANSLATE performs character-by-character substitution — each character in the FROM set is replaced by the corresponding character in the TO set: TRANSLATE(string, from_chars, to_chars). It processes the entire character-by-character mapping in a single pass. TRANSLATE('aeiou', 'aeiou', '12345') replaces a→1, e→2, i→3, o→4, u→5 in one call. If the TO set is shorter than the FROM set, characters with no corresponding TO character are deleted.

The practical choice: use REPLACE when substituting specific substrings or words — REPLACE(text, 'Ltd', 'Limited'). Use TRANSLATE when you need to substitute or remove many individual characters in one pass — TRANSLATE(phone, '()-+ ', '') removes five different characters in a single function call instead of five nested REPLACE calls. For phone number cleaning, TRANSLATE is significantly cleaner than REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(phone, '(', ''), ')', ''), '-', ''), '+', ''), ' ', '')).

Q: How would you extract the domain from an email address using SQL string functions?

Two approaches. Using SPLIT_PART: SPLIT_PART(email, '@', 2) splits by '@' and returns the second segment — the domain. For 'user@gmail.com' this returns 'gmail.com'. SPLIT_PART is the cleanest approach when the delimiter is a single consistent character.

Using SUBSTRING and POSITION: SUBSTRING(email, POSITION('@' IN email) + 1) starts extraction from one character after the '@' and takes everything to the end. POSITION('@' IN 'user@gmail.com') returns 5, so SUBSTRING starts from position 6, returning 'gmail.com'. This approach is more verbose but is portable across databases that may not support SPLIT_PART.

To further extract just the domain name without the TLD: SPLIT_PART(SPLIT_PART(email, '@', 2), '.', 1) extracts 'gmail' from 'gmail.com'. To validate that the email has an @ before extracting: WHERE POSITION('@' IN email) > 0 ensures only valid emails are processed. To handle NULLs: SPLIT_PART(COALESCE(email, ''), '@', 2) returns empty string for NULL emails rather than NULL. In production data cleaning, always validate the email format first with LIKE '%@%.%' or a regex before attempting extraction — malformed emails can cause unexpected results.

Q: What does STRING_AGG do and how is it different from GROUP_CONCAT in MySQL?

STRING_AGG is an aggregate function that concatenates string values from multiple rows into a single string, with a specified separator between each value. It is the PostgreSQL and DuckDB syntax. STRING_AGG(product_name, ', ' ORDER BY product_name) collects all product_name values in the group, sorts them alphabetically, and joins them with ', '. The result is one string per group — for a GROUP BY store_id, each store gets one string listing all its products.

GROUP_CONCAT is the MySQL equivalent: GROUP_CONCAT(product_name ORDER BY product_name SEPARATOR ', '). The functionality is identical — both aggregate string values from multiple rows into one concatenated string with a separator. The syntax differs: STRING_AGG uses standard SQL aggregate function syntax with the separator as the second argument; GROUP_CONCAT uses MySQL's non-standard keyword syntax with SEPARATOR.

Important differences: STRING_AGG returns NULL when there are no rows in the group (consistent with SUM/AVG behaviour). GROUP_CONCAT returns NULL in the same case. STRING_AGG in PostgreSQL has a strict type requirement — the column must be text or castable to text. STRING_AGG supports ORDER BY inside the function to control the concatenation order. Both support DISTINCT to deduplicate values before aggregating: STRING_AGG(DISTINCT category, ', ') produces a deduplicated list. Both have length limits (PostgreSQL default 1GB; MySQL default 1024 bytes configurable with group_concat_max_len). For large result sets, consider whether aggregating to a string is the right approach or whether a separate rows-based result would be cleaner.

Q: How do you handle NULL values in string operations?

NULL propagates through most string operations — any expression involving NULL produces NULL. 'Hello' || NULL || 'World' = NULL (not 'HelloWorld'). LENGTH(NULL) = NULL. LOWER(NULL) = NULL. TRIM(NULL) = NULL. This is the correct SQL behaviour but it can cause surprising results when any input column contains NULLs.

COALESCE is the primary tool for handling NULLs in string operations: COALESCE(column, '') replaces NULL with an empty string, allowing concatenation to proceed. first_name || ' ' || COALESCE(middle_name, '') || ' ' || last_name builds a full name even when middle_name is NULL. COALESCE(city, 'Unknown') provides a display label for NULL city values.

CONCAT() in MySQL treats NULL as empty string rather than propagating it — CONCAT('Hello', NULL, 'World') = 'HelloWorld' in MySQL, unlike PostgreSQL's || which would return NULL. This makes CONCAT() safer for nullable columns in MySQL but masks potential data issues. CONCAT_WS (Concatenate With Separator) also skips NULL values entirely — CONCAT_WS(', ', city, state, country) with a NULL city produces 'state, country' rather than ', state, country'. For regex and pattern matching functions, NULL inputs always return NULL — wrap nullable columns with COALESCE before passing to REGEXP_REPLACE or REGEXP_EXTRACT. The consistent rule: use COALESCE to substitute a meaningful default (empty string, 'Unknown', 'N/A') before any string operation on a nullable column.

// Part 16

Errors You Will Hit — And Exactly Why They Happen

String comparison fails to match — 'Seattle' != 'bangalore' — joins miss rows

Cause: SQL string comparisons are case-sensitive in PostgreSQL and most databases. 'Seattle' and 'bangalore' are different values. A JOIN or WHERE using direct equality will miss rows where the case differs between the two columns being compared. This is especially common after bulk imports where different sources used different conventions.

Fix: Normalise both sides with LOWER() before comparing: WHERE LOWER(city) = 'bangalore' or JOIN ON LOWER(a.city) = LOWER(b.city). For PostgreSQL, ILIKE is the case-insensitive alternative to LIKE. Long-term fix: enforce consistent case at write time using CHECK constraints or triggers, so all city values are always title case. For frequent queries, create a functional index on LOWER(city) to maintain performance while using case-insensitive comparisons.

String concatenation returns NULL — full_name is NULL even though first_name and last_name are not

Cause: The || operator (and CONCAT in PostgreSQL) returns NULL if any operand is NULL. If middle_name is NULL, first_name || ' ' || middle_name || ' ' || last_name = NULL — the NULL in the middle propagates to the entire expression. Any NULL in the concatenation chain nullifies the whole result.

Fix: Wrap nullable columns with COALESCE: COALESCE(middle_name, '') or use CONCAT_WS which skips NULL values. For a full name with optional middle name: first_name || CASE WHEN middle_name IS NOT NULL THEN ' ' || middle_name ELSE '' END || ' ' || last_name. Or use CONCAT_WS(' ', first_name, middle_name, last_name) — CONCAT_WS skips NULLs and only inserts the separator between non-NULL values.

LIKE pattern not matching as expected — wildcard seems to be ignored

Cause: Three common causes: (1) Case sensitivity — LIKE 'bangalore%' does not match 'Seattle%' in PostgreSQL. (2) Whitespace — the value has leading/trailing spaces that the pattern does not account for. (3) Incorrect wildcard placement — LIKE 'Seattle' without any % or _ is an exact match, not a pattern. The value must exactly equal 'Seattle' for it to match.

Fix: Use ILIKE instead of LIKE for case-insensitive matching. Wrap the column with TRIM() before LIKE: WHERE TRIM(city) LIKE 'Seattle%'. Verify wildcards are in the right position: LIKE '%Seattle%' matches the string anywhere; LIKE 'Seattle%' only matches strings starting with 'Seattle'. Test the pattern in isolation: SELECT 'Seattle' LIKE 'bangalore%' — this returns FALSE without ILIKE, confirming the case-sensitivity issue.

SUBSTRING returns unexpected result — wrong characters extracted

Cause: SUBSTRING in SQL is 1-indexed (the first character is at position 1), not 0-indexed like most programming languages. SUBSTRING('Hello', 0, 3) may return different results than expected — position 0 is before the first character. Off-by-one errors are common when developers from Python/JavaScript backgrounds expect 0-based indexing.

Fix: Remember SQL string positions are 1-indexed: SUBSTRING('Hello', 1, 3) returns 'Hel' (3 characters starting from position 1). SUBSTRING('Hello', 2, 3) returns 'ell' (3 characters starting from position 2). Test with a known string: SELECT SUBSTRING('Hello World', 7, 5) should return 'World'. Use POSITION to find the starting position dynamically: SUBSTRING(email, POSITION('@' IN email) + 1) correctly extracts everything after the @.

REPLACE only removes the first occurrence — other occurrences remain

Cause: This is a MySQL-specific behaviour in some contexts, or confusion with programming language string replace behaviour. In standard SQL (PostgreSQL, DuckDB), REPLACE replaces ALL occurrences by default. In some string libraries in application code, replace() only replaces the first occurrence unless a global flag is used.

Fix: In PostgreSQL and DuckDB, REPLACE always replaces all occurrences: REPLACE('aababab', 'ab', 'X') returns 'aXXX' — all three occurrences replaced. If you are seeing only the first occurrence replaced, verify you are using the SQL REPLACE function (not application-level string replace) and that you are running the query on the correct database. For REGEXP_REPLACE, use the 'g' flag for global replacement: REGEXP_REPLACE(text, pattern, replacement, 'g').

Try It Yourself

Write a query that generates a clean customer directory from FreshCart's customers table. Each row should show: a display_id (format: 'CUST-' followed by zero-padded 6-digit customer_id), clean_name (first + last name in Title Case, trimmed), username (everything before @ in email, lowercased), domain (everything after @ in email, lowercased), city_display (city in Title Case), a customer_slug (lowercase first name + '.' + lowercase last name + '-' + customer_id, all spaces replaced with hyphens), and a short_label (first initial + '. ' + last name, e.g. 'A. Khan'). Order by clean_name. Only include customers whose email contains an @ symbol.

🎯 Key Takeaways

✓String functions transform text at query time without modifying stored data — use them for display formatting, data cleaning, join key normalisation, and validation.
✓Always normalise case before comparing or joining string columns: LOWER(city) = LOWER(other_city). Use ILIKE (PostgreSQL/DuckDB) for case-insensitive LIKE matching.
✓TRIM removes leading and trailing whitespace. Always TRIM before comparing or joining on string columns — a trailing space makes 'Seattle' != 'Seattle ' causing silent join misses.
✓NULL propagates through string operations: 'Hello' || NULL = NULL. Wrap nullable columns with COALESCE before concatenation. CONCAT_WS skips NULLs automatically.
✓REPLACE substitutes substrings. TRANSLATE substitutes individual characters in one pass — use TRANSLATE for cleaning multiple different characters (phone number symbols) instead of chaining REPLACE.
✓SQL strings are 1-indexed: SUBSTRING(str, 1, 3) returns the first 3 characters. SPLIT_PART(str, delimiter, n) returns the nth segment (1-indexed).
✓LIKE uses % (any characters) and _ (one character). ILIKE is the case-insensitive variant. Applying LIKE/LOWER on indexed columns prevents index use — create functional indexes for frequent case-insensitive searches.
✓STRING_AGG(column, separator ORDER BY col) aggregates multiple rows into one concatenated string — essential for building comma-separated lists and human-readable summaries.
✓The standard cleaning chain: INITCAP(LOWER(TRIM(column))) produces consistently formatted Title Case from any input regardless of original capitalisation or whitespace.
✓Build cleaning pipelines as SELECT first, verify with spot checks on 20+ rows, then wrap in INSERT. Cleaning bugs caught in SELECT cost nothing. Bugs found after INSERT require DELETE + reclean + reinsert.

What comes next

In Module 42, you learn date and time functions — extracting parts, calculating differences, formatting, truncating to periods, and every temporal operation needed for time-series analytics and reporting.

Module 42 → Date and Time Functions

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub