String Functions
Every text manipulation tool — concatenation, case, length, trimming, substrings, replacement, padding, splitting, and pattern matching for real data cleaning
// Part 01
Why String Functions Matter in Real Data
In an ideal world, every string column is clean, consistently formatted, and ready to query. In production, data arrives from mobile apps, web forms, CSV imports, partner APIs, and legacy systems — each with its own formatting quirks. Customer names with extra spaces, email addresses in mixed case, phone numbers formatted a dozen different ways, product codes with inconsistent separators.
String functions are the tools that standardise, clean, extract, and transform text data at query time — without modifying the underlying rows. They are essential for data quality checks, display formatting, join key normalisation (so 'bangalore' matches 'Seattle'), and extracting structured information from unstructured text fields.
// Part 02
Concatenation — Joining Strings Together
The || operator and CONCAT()
// Part 03
Case Functions — UPPER, LOWER, INITCAP
// Part 04
Length Functions — Measuring Strings
// Part 05
Trimming — Removing Whitespace and Characters
Leading and trailing whitespace is one of the most common data quality issues — especially in data imported from CSV files or form submissions. TRIM removes it cleanly. Untrimmed strings cause join mismatches and incorrect comparisons silently.
// Part 06
Substring Functions — Extracting Parts of Strings
SUBSTRING(str, start_pos, length) -- 1-indexedExtract a portion starting at start_pos for length characters. Negative start not supported in standard SQL.
LEFT(str, n)Returns the first n characters. LEFT('Seattle', 3) = 'Ban'
RIGHT(str, n)Returns the last n characters. RIGHT('ST001', 3) = '001'
SPLIT_PART(str, delimiter, n) -- 1-indexedSplits by delimiter and returns the Nth segment. SPLIT_PART('a,b,c', ',', 2) = 'b'
// Part 07
String Search — POSITION, STRPOS, CONTAINS
// Part 08
REPLACE and TRANSLATE — Substituting Characters
// Part 09
Padding — LPAD and RPAD
Padding adds characters to reach a target length — essential for generating fixed-width codes, formatting report columns for alignment, and generating zero-padded IDs.
// Part 10
LIKE and Pattern Matching — Flexible String Filtering
LIKE with wildcards is the standard SQL pattern matching tool. % matches any sequence of characters (including none). _ matches exactly one character. For more powerful patterns, SIMILAR TO (PostgreSQL) and regular expressions provide regex capabilities.
// Part 11
Regular Expressions — Powerful Pattern Matching
Regular expressions (regex) provide full pattern matching power — validating formats, extracting structured data from free text, and finding complex patterns. PostgreSQL uses ~ for regex match and ~* for case-insensitive match. DuckDB uses REGEXP_MATCHES.
// Part 12
Aggregating Strings — STRING_AGG and ARRAY_AGG
Standard aggregates collapse values to a single number. STRING_AGG collapses multiple string values into a single concatenated string — essential for building comma-separated lists, generating labels, and creating readable summaries.
// Part 13
Real-World Data Cleaning Pipeline
In practice, string functions are combined into cleaning pipelines — multiple transformations chained together to take raw dirty data and produce standardised output ready for joining, reporting, or loading into a clean table.
// Part 14
What This Looks Like at Work
You are a data engineer at Shopify. A partner has delivered a CSV of 50,000 seller records — names are in inconsistent case, phone numbers have various formats, city names have typos and whitespace, and emails are uncleaned. Before loading into the sellers table, you write a cleaning query that standardises every string field.
🎯 Pro Tip
Always create a cleaning SELECT before writing the INSERT. Run the SELECT, spot-check 20 rows manually, and verify edge cases (names with special characters, NULL values, extremely short or long values). Only wrap in INSERT once the SELECT output looks correct. Cleaning bugs caught in SELECT cost nothing. Cleaning bugs discovered after INSERT cost a DELETE, a fix, and a re-insert.
// Part 15
Interview Prep — 5 Questions With Complete Answers
Three approaches. First, normalise both sides using LOWER() or UPPER() before comparing: WHERE LOWER(city) = LOWER('bangalore'). This converts both values to the same case before the comparison, making it case-insensitive. Second, in PostgreSQL and DuckDB, use ILIKE instead of LIKE: WHERE city ILIKE 'bangalore' — ILIKE is the case-insensitive version of LIKE and supports wildcards. Third, for joins on string columns, normalise the join key: JOIN stores ON LOWER(TRIM(o.city)) = LOWER(TRIM(s.city)).
The most important practical application is join key normalisation. Two tables may store the same city as 'Seattle' and 'bangalore' — an INNER JOIN without normalisation would miss the match entirely. Using LOWER(TRIM(city)) on both sides ensures the join finds matches regardless of case or whitespace differences.
Performance consideration: applying LOWER() or TRIM() to a column in WHERE prevents the database from using an index on that column — because the index stores the original values, not the lowercased versions. For high-frequency case-insensitive searches on large tables, create a functional index: CREATE INDEX idx_customers_city_lower ON customers (LOWER(city)). Then WHERE LOWER(city) = 'bangalore' can use this index efficiently. Alternatively, enforce case normalisation at write time (store all city values in title case) so queries never need to apply LOWER().
REPLACE substitutes all occurrences of a specific substring with a replacement string: REPLACE(string, find_string, replace_string). It works on substrings of any length. REPLACE('Hello World', 'World', 'SQL') returns 'Hello SQL'. REPLACE can only handle one substitution pattern per call — to substitute multiple different patterns requires chaining multiple REPLACE calls.
TRANSLATE performs character-by-character substitution — each character in the FROM set is replaced by the corresponding character in the TO set: TRANSLATE(string, from_chars, to_chars). It processes the entire character-by-character mapping in a single pass. TRANSLATE('aeiou', 'aeiou', '12345') replaces a→1, e→2, i→3, o→4, u→5 in one call. If the TO set is shorter than the FROM set, characters with no corresponding TO character are deleted.
The practical choice: use REPLACE when substituting specific substrings or words — REPLACE(text, 'Ltd', 'Limited'). Use TRANSLATE when you need to substitute or remove many individual characters in one pass — TRANSLATE(phone, '()-+ ', '') removes five different characters in a single function call instead of five nested REPLACE calls. For phone number cleaning, TRANSLATE is significantly cleaner than REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(phone, '(', ''), ')', ''), '-', ''), '+', ''), ' ', '')).
Two approaches. Using SPLIT_PART: SPLIT_PART(email, '@', 2) splits by '@' and returns the second segment — the domain. For 'user@gmail.com' this returns 'gmail.com'. SPLIT_PART is the cleanest approach when the delimiter is a single consistent character.
Using SUBSTRING and POSITION: SUBSTRING(email, POSITION('@' IN email) + 1) starts extraction from one character after the '@' and takes everything to the end. POSITION('@' IN 'user@gmail.com') returns 5, so SUBSTRING starts from position 6, returning 'gmail.com'. This approach is more verbose but is portable across databases that may not support SPLIT_PART.
To further extract just the domain name without the TLD: SPLIT_PART(SPLIT_PART(email, '@', 2), '.', 1) extracts 'gmail' from 'gmail.com'. To validate that the email has an @ before extracting: WHERE POSITION('@' IN email) > 0 ensures only valid emails are processed. To handle NULLs: SPLIT_PART(COALESCE(email, ''), '@', 2) returns empty string for NULL emails rather than NULL. In production data cleaning, always validate the email format first with LIKE '%@%.%' or a regex before attempting extraction — malformed emails can cause unexpected results.
STRING_AGG is an aggregate function that concatenates string values from multiple rows into a single string, with a specified separator between each value. It is the PostgreSQL and DuckDB syntax. STRING_AGG(product_name, ', ' ORDER BY product_name) collects all product_name values in the group, sorts them alphabetically, and joins them with ', '. The result is one string per group — for a GROUP BY store_id, each store gets one string listing all its products.
GROUP_CONCAT is the MySQL equivalent: GROUP_CONCAT(product_name ORDER BY product_name SEPARATOR ', '). The functionality is identical — both aggregate string values from multiple rows into one concatenated string with a separator. The syntax differs: STRING_AGG uses standard SQL aggregate function syntax with the separator as the second argument; GROUP_CONCAT uses MySQL's non-standard keyword syntax with SEPARATOR.
Important differences: STRING_AGG returns NULL when there are no rows in the group (consistent with SUM/AVG behaviour). GROUP_CONCAT returns NULL in the same case. STRING_AGG in PostgreSQL has a strict type requirement — the column must be text or castable to text. STRING_AGG supports ORDER BY inside the function to control the concatenation order. Both support DISTINCT to deduplicate values before aggregating: STRING_AGG(DISTINCT category, ', ') produces a deduplicated list. Both have length limits (PostgreSQL default 1GB; MySQL default 1024 bytes configurable with group_concat_max_len). For large result sets, consider whether aggregating to a string is the right approach or whether a separate rows-based result would be cleaner.
NULL propagates through most string operations — any expression involving NULL produces NULL. 'Hello' || NULL || 'World' = NULL (not 'HelloWorld'). LENGTH(NULL) = NULL. LOWER(NULL) = NULL. TRIM(NULL) = NULL. This is the correct SQL behaviour but it can cause surprising results when any input column contains NULLs.
COALESCE is the primary tool for handling NULLs in string operations: COALESCE(column, '') replaces NULL with an empty string, allowing concatenation to proceed. first_name || ' ' || COALESCE(middle_name, '') || ' ' || last_name builds a full name even when middle_name is NULL. COALESCE(city, 'Unknown') provides a display label for NULL city values.
CONCAT() in MySQL treats NULL as empty string rather than propagating it — CONCAT('Hello', NULL, 'World') = 'HelloWorld' in MySQL, unlike PostgreSQL's || which would return NULL. This makes CONCAT() safer for nullable columns in MySQL but masks potential data issues. CONCAT_WS (Concatenate With Separator) also skips NULL values entirely — CONCAT_WS(', ', city, state, country) with a NULL city produces 'state, country' rather than ', state, country'. For regex and pattern matching functions, NULL inputs always return NULL — wrap nullable columns with COALESCE before passing to REGEXP_REPLACE or REGEXP_EXTRACT. The consistent rule: use COALESCE to substitute a meaningful default (empty string, 'Unknown', 'N/A') before any string operation on a nullable column.
// Part 16
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓String functions transform text at query time without modifying stored data — use them for display formatting, data cleaning, join key normalisation, and validation.
- ✓Always normalise case before comparing or joining string columns: LOWER(city) = LOWER(other_city). Use ILIKE (PostgreSQL/DuckDB) for case-insensitive LIKE matching.
- ✓TRIM removes leading and trailing whitespace. Always TRIM before comparing or joining on string columns — a trailing space makes 'Seattle' != 'Seattle ' causing silent join misses.
- ✓NULL propagates through string operations: 'Hello' || NULL = NULL. Wrap nullable columns with COALESCE before concatenation. CONCAT_WS skips NULLs automatically.
- ✓REPLACE substitutes substrings. TRANSLATE substitutes individual characters in one pass — use TRANSLATE for cleaning multiple different characters (phone number symbols) instead of chaining REPLACE.
- ✓SQL strings are 1-indexed: SUBSTRING(str, 1, 3) returns the first 3 characters. SPLIT_PART(str, delimiter, n) returns the nth segment (1-indexed).
- ✓LIKE uses % (any characters) and _ (one character). ILIKE is the case-insensitive variant. Applying LIKE/LOWER on indexed columns prevents index use — create functional indexes for frequent case-insensitive searches.
- ✓STRING_AGG(column, separator ORDER BY col) aggregates multiple rows into one concatenated string — essential for building comma-separated lists and human-readable summaries.
- ✓The standard cleaning chain: INITCAP(LOWER(TRIM(column))) produces consistently formatted Title Case from any input regardless of original capitalisation or whitespace.
- ✓Build cleaning pipelines as SELECT first, verify with spot checks on 20+ rows, then wrap in INSERT. Cleaning bugs caught in SELECT cost nothing. Bugs found after INSERT require DELETE + reclean + reinsert.
What comes next
In Module 42, you learn date and time functions — extracting parts, calculating differences, formatting, truncating to periods, and every temporal operation needed for time-series analytics and reporting.
Module 42 → Date and Time FunctionsDiscussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.