Data Warehouse Concepts
Columnar storage, query execution, result caching, virtual warehouses, cluster keys, and why Snowflake queries are fast.
Understanding the Warehouse From the Inside
Most data engineers write SQL against Snowflake or BigQuery without understanding how those systems actually execute the query. This gap produces pipelines with unexplained performance problems, unnecessarily expensive queries, and architectural decisions that make no sense once you understand the internals.
This module covers the concepts that explain warehouse behaviour: why columnar storage makes analytical queries fast, how a virtual warehouse scales, what micro-partition pruning and result caching actually do, why distribution keys matter in Redshift, and how to read a query plan to find the bottleneck. These are the concepts that turn a data engineer who writes SQL into one who understands what the warehouse is doing with it.
Columnar Storage — The Physical Layout That Makes Analytics Fast
The single biggest reason modern analytical databases are fast for analytical workloads is columnar storage. Every analytical query that reads 3 columns out of a 200-column table benefits enormously from storing data by column rather than by row. Understanding why requires understanding how data is physically laid out on disk and what the CPU and I/O costs of reading it are.
Row storage vs column storage — the physical difference
SAMPLE DATA (orders table, 5 rows, 6 columns):
order_id customer_id store_id amount status created_at
─────────────────────────────────────────────────────────────────
9284751 4201938 ST001 380.00 delivered 2026-03-17
9284752 4201939 ST002 245.00 cancelled 2026-03-17
9284753 4201940 ST001 890.00 delivered 2026-03-17
9284754 4201941 ST003 125.00 placed 2026-03-17
9284755 4201938 ST001 460.00 confirmed 2026-03-17
ROW STORAGE (PostgreSQL, MySQL, OLTP databases):
Disk layout — all columns of one row stored together:
[9284751][4201938][ST001][380.00][delivered][2026-03-17]
[9284752][4201939][ST002][245.00][cancelled][2026-03-17]
[9284753][4201940][ST001][890.00][delivered][2026-03-17]
...
QUERY: SELECT SUM(amount) FROM orders WHERE status = 'delivered'
Must READ all 6 columns for all 5 rows to compute sum of 1 column.
For 500M rows × 6 columns × 8 bytes avg = 24 GB read from disk.
Most data read is thrown away (5 columns not needed, 2 rows not matching).
COLUMN STORAGE (Snowflake, BigQuery, Redshift, Parquet):
Disk layout — all values of one column stored together:
amount column: [380.00][245.00][890.00][125.00][460.00]
status column: [delivered][cancelled][delivered][placed][confirmed]
customer_id col: [4201938][4201939][4201940][4201941][4201938]
...
QUERY: SELECT SUM(amount) FROM orders WHERE status = 'delivered'
Step 1: Read status column → find rows 1, 3 (delivered)
Step 2: Read amount column → only rows 1, 3: [380.00][890.00]
Step 3: Compute SUM = 1270.00
For 500M rows × 2 columns × 8 bytes avg = 8 GB read (vs 24 GB)
→ 3× less I/O for a 1-of-6-column query
MORE DRAMATIC EXAMPLE:
Table has 200 columns. Query uses 3 columns.
Row storage: reads all 200 columns = 1.6 TB
Column storage: reads 3 columns = 24 GB
→ 66× less I/O
WHY COLUMN STORAGE ALSO COMPRESSES BETTER:
Columns store the same data type repeatedly.
status column: [delivered][delivered][delivered][cancelled][cancelled]
Adjacent same values → run-length encoding compresses 5 values to 2.
Row storage: [delivered, 380.00, ST001] — mixed types, poor compression.
Column storage: all amounts together, all statuses together — high compression.
Real compression ratios on typical analytical data:
Row storage (PostgreSQL): 1.5-2× compression
Column storage (Parquet/Snowflake): 5-15× compression
→ 5-10× less storage = 5-10× cheaper on S3/SnowflakeCompression encoding per column type
COMPRESSION ENCODING TYPES — chosen automatically by the warehouse:
RUN-LENGTH ENCODING (RLE):
Best for: low-cardinality columns with runs of identical values
Example: status column sorted by status:
CANCELLED × 8,432, CONFIRMED × 12,847, DELIVERED × 319,284
Storage: [("CANCELLED", 8432), ("CONFIRMED", 12847), ("DELIVERED", 319284)]
Instead of storing 340,563 individual strings → store 3 pairs
Compression ratio: 100,000:1 for very low cardinality sorted data
DICTIONARY ENCODING:
Best for: categorical columns with moderate cardinality
Example: status has 5 distinct values out of 500M rows
Dictionary: {0: 'placed', 1: 'confirmed', 2: 'delivered', 3: 'cancelled', 4: 'refunded'}
Column stored as: [2, 0, 1, 3, 2, 2, ...] (1-byte integers not strings)
vs storing 'delivered' × 500M = 4.5 GB
→ Stored as: 500M × 1 byte = 500 MB (9× compression)
Queries on status column compare integers, not strings → faster
DELTA ENCODING:
Best for: monotonically increasing numeric columns (IDs, timestamps)
Example: order_id column:
[9284751, 9284752, 9284753, 9284754, 9284755]
Instead of: store each 8-byte integer
Delta encode: [9284751, +1, +1, +1, +1]
The base + small deltas compress further with RLE
Created_at timestamps: store base + deltas in seconds (small ints)
BIT PACKING:
Best for: small integer columns
order_hour (0-23): requires only 5 bits, not 32/64 bits
is_promo (0 or 1): 1 bit
SNOWFLAKE AUTOMATIC ENCODING:
Snowflake selects optimal encoding per column automatically.
No manual configuration needed.
During COPY INTO / INSERT: Snowflake samples data to determine encoding.
You can view the chosen encoding:
SELECT column_name, data_type, compression
FROM information_schema.columns
WHERE table_name = 'ORDERS';
PRACTICAL IMPLICATION FOR DATA ENGINEERS:
Choose data types correctly — they determine compression effectiveness.
VARCHAR(100) for a status with 5 short values: poor compression signal.
VARCHAR(20) for status: tells the warehouse the values are short.
Numeric types for amounts (not VARCHAR): enables numeric encoding.
TIMESTAMPTZ for timestamps: enables delta encoding.
BOOLEAN for flags: bit packing.
Never store numbers as strings — loses all numeric compression and
query-time numeric comparison performance.Snowflake Internals — Micro-Partitions, Pruning, and Caching
Snowflake is the most widely used cloud data warehouse in 2026. Its architecture differs fundamentally from traditional warehouses — understanding its three-layer design (storage, query processing, cloud services) explains both why it is fast and where it can be slow.
Snowflake's three-layer architecture
SNOWFLAKE THREE-LAYER ARCHITECTURE:
LAYER 1: CLOUD SERVICES LAYER (always running)
─────────────────────────────────────────────────────────────────────
• Query parser and optimizer
• Metadata repository (table stats, partition info, access control)
• Transaction management
• Result cache (query results stored here, not in compute)
• Runs 24/7 — always available even when warehouses are suspended
• No per-query cost — included in Snowflake's base fee
• Single global metadata store for all virtual warehouses
LAYER 2: QUERY PROCESSING LAYER (virtual warehouses)
─────────────────────────────────────────────────────────────────────
• Virtual warehouse: a cluster of EC2 nodes (Snowflake manages them)
• Each warehouse is independent — no resource sharing between warehouses
• Local SSD disk cache (data reads from S3 are cached here)
• Scales up (larger node type) or out (multi-cluster) independently
• Auto-suspend after idle time (no compute cost when idle)
• Auto-resume on query (takes 1-5 seconds)
LAYER 3: STORAGE LAYER (micro-partitions on S3)
─────────────────────────────────────────────────────────────────────
• All table data stored as encrypted Parquet files in Snowflake-managed S3
• Organised into micro-partitions (50-500 MB compressed)
• Each micro-partition has column-level metadata (min, max, count, null%)
• Metadata stored in cloud services layer — never requires data read
• Immutable: new writes create new micro-partitions, old ones marked deleted
QUERY EXECUTION PATH:
1. Query submitted → Cloud Services parses and optimises
2. Cloud Services checks RESULT CACHE
→ If cache hit: return cached result instantly (no warehouse needed)
3. Cloud Services evaluates METADATA to prune micro-partitions
→ "WHERE date = '2026-03-17'" — prune all partitions without March 17
4. Virtual warehouse reads only surviving micro-partitions from S3
→ SSD cache hit: data already on local disk from previous query
→ S3 read: fetch from storage (slower)
5. Warehouse executes operators (filter, join, aggregate)
6. Result returned, written to result cache
COST BREAKDOWN:
Idle warehouse (auto-suspended): $0/hour compute
Active warehouse: $2-$16/credit × warehouse credits/hour
Storage: ~$40/TB/month (compressed)
Cloud services queries (metadata only): included in storage costMicro-partitions — Snowflake's storage unit
MICRO-PARTITION PROPERTIES:
Size: 50–500 MB compressed (16–512 MB uncompressed)
Organisation: data written to micro-partitions in arrival order
(not sorted unless CLUSTER BY is configured)
Metadata: each micro-partition has column-level statistics:
- min value per column
- max value per column
- null count per column
- distinct count (approximate) per column
MICRO-PARTITION PRUNING:
When a query has a WHERE clause, Snowflake compares it to micro-partition
metadata BEFORE reading any data. Partitions that cannot contain matching
rows are skipped entirely.
Example: SELECT * FROM orders WHERE order_date = '2026-03-17'
Micro-partition A: min_order_date=2026-03-15, max_order_date=2026-03-16
→ PRUNE: max is before 2026-03-17, cannot contain matching rows
Micro-partition B: min_order_date=2026-03-16, max_order_date=2026-03-18
→ KEEP: range includes 2026-03-17
Micro-partition C: min_order_date=2026-03-18, max_order_date=2026-03-20
→ PRUNE: min is after 2026-03-17
If 99% of micro-partitions are pruned: 1% of data read — dramatic speedup.
If data is randomly distributed: no pruning possible, full scan.
WHEN PRUNING WORKS WELL:
✓ Querying by order_date when data was loaded daily (natural ordering)
✓ Querying recent data when recent = last rows loaded
✓ Range filters on monotonically increasing columns (IDs, timestamps)
WHEN PRUNING DOES NOT WORK:
✗ Querying by customer_id (random ordering — no correlation with load order)
✗ Querying by status (random distribution across all partitions)
✗ Applying functions to columns: WHERE YEAR(order_date) = 2026
→ Snowflake cannot use date metadata for YEAR() function
→ Use: WHERE order_date >= '2026-01-01' AND order_date < '2027-01-01'
CHECKING PRUNING EFFECTIVENESS:
After running a query, check the query profile in Snowflake UI:
"Partitions scanned" vs "Partitions total"
Good: 100 of 10,000 partitions scanned (1%)
Bad: 9,800 of 10,000 partitions scanned (98%) → consider CLUSTER BYCluster keys — forcing good pruning on any column
PROBLEM: orders table loaded daily, 500M rows, 10,000 micro-partitions
natural order = order in which rows arrived (mostly date-ordered by luck)
Queries by order_date: good pruning (data is mostly date-ordered)
Queries by store_id: poor pruning (stores randomly distributed)
Query: SELECT * FROM orders WHERE store_id = 'ST001'
Without CLUSTER BY:
ST001 rows spread randomly across all 10,000 partitions
Snowflake must scan all 10,000 partitions
Partitions scanned: 10,000 of 10,000 (100%) — full table scan
SOLUTION: ALTER TABLE silver.orders CLUSTER BY (store_id, order_date);
Snowflake runs a background reclustering process:
Reads existing micro-partitions
Re-sorts data by (store_id, order_date)
Writes new micro-partitions where each contains similar store_id values
After reclustering:
ST001 rows co-located in same micro-partitions
Partitions scanned for store_id='ST001': ~100 of 10,000 (1%)
CLUSTER KEY SELECTION RULES:
✓ Columns most frequently used in WHERE clauses
✓ High-cardinality columns where equality or range queries are common
✓ Typically: (most_common_filter, second_most_common_filter)
✓ For event tables: (tenant_id, event_date) or (store_id, order_date)
✗ Don't cluster by a timestamp to microsecond precision (too many distinct values)
✗ Don't cluster by a boolean (only 2 values — poor co-location)
✗ Don't cluster small tables (< 1 TB) — overhead exceeds benefit
CLUSTER KEY COST:
Reclustering is a background compute operation — has a cost.
Monitor via: SELECT * FROM snowflake.account_usage.automatic_clustering_history;
For high-write tables: disable automatic clustering, run manual clustering
during maintenance windows to control cost:
ALTER TABLE silver.orders SUSPEND RECLUSTER;
-- Trigger manually:
ALTER TABLE silver.orders RESUME RECLUSTER;
CHECKING CLUSTER DEPTH:
SELECT SYSTEM$CLUSTERING_INFORMATION('silver.orders', '(store_id, order_date)');
Returns: average_depth (lower = better co-location, 1.0 = perfect)
Good: average_depth < 2
Needs reclustering: average_depth > 5Caching Layers — Why Some Queries Are Instantaneous
Snowflake has three distinct caching layers, each with different scope and lifetime. Understanding them explains why some queries return in milliseconds and others take minutes — and why the same query run twice in quick succession may have wildly different execution times.
CACHE LAYER 1: RESULT CACHE (cloud services layer)
─────────────────────────────────────────────────────────────────────
What: The complete result set of a query
Where: Cloud services layer (not in the virtual warehouse)
Duration: 24 hours (reset to 24h on each access of the same result)
Scope: Per user per query — exact SQL match required
Cost: Zero compute — result cache queries do not use a warehouse
WHEN RESULT CACHE HITS:
- Same exact SQL text (including whitespace and case)
- Same session parameters
- Underlying data has NOT changed since the result was cached
- User has the same access permissions
INVALIDATION:
- Data in any referenced table changes (DML or new micro-partitions)
- 24-hour expiry without access
- Table structure changes (ALTER TABLE)
CHECKING IF RESULT CACHE WAS USED:
In query history: look for "result cache" in execution details
Or: SELECT query_type, total_elapsed_time, result_from_cache
FROM snowflake.account_usage.query_history
WHERE query_text ILIKE '%SELECT%' LIMIT 20;
PRACTICAL IMPACT:
Daily analytics queries that run the same SELECT at 8 AM and 9 AM:
8 AM run: 45 seconds (full execution)
9 AM run: < 1 second (result cache hit) — if no new data was loaded
If Silver pipeline ran at 8:30 AM: 9 AM run = 45 seconds (cache invalidated)
CACHE LAYER 2: LOCAL DISK CACHE (virtual warehouse)
─────────────────────────────────────────────────────────────────────
What: Raw micro-partition data read from S3
Where: SSD disk on virtual warehouse nodes
Duration: Until the warehouse is suspended (cache evicted on suspend)
Scope: Per virtual warehouse (all users on same warehouse benefit)
HOW IT WORKS:
First query reads micro-partitions from S3 → stores on local SSD
Subsequent queries needing same partitions: reads from SSD (10× faster)
Cache is managed with LRU (least recently used) eviction
IMPLICATION: do not suspend-and-resume too aggressively
Auto-suspend after 10 minutes: cache lost every 10 minutes
Cold queries take longer after resume (S3 reads, not SSD reads)
Balance: suspend when genuinely idle, but not after every single query
CACHE LAYER 3: METADATA CACHE (cloud services layer)
─────────────────────────────────────────────────────────────────────
What: Table statistics and micro-partition metadata
Where: Cloud services layer (persistent)
Duration: Always maintained — updated on every write
Cost: Included in storage fees
WHAT IT ENABLES:
SHOW TABLES, COUNT(*), MAX(id): answered from metadata alone
No warehouse needed for:
SELECT COUNT(*) FROM large_table → returns instantly from metadata
SELECT MAX(updated_at) FROM silver.orders → metadata-only query
These queries appear in query history with result_from_cache=TRUE
and total_elapsed_time < 100ms regardless of table size
ARCHITECT YOUR QUERIES AROUND CACHING:
BI tool scheduled refreshes: refresh at fixed intervals AFTER pipeline runs
not before — otherwise analysts get cached stale results
Warm-up queries: some teams run a "warm-up" script after each pipeline run
to pre-populate the disk cache for common dashboard queries
Avoid cache busting: parameterised queries with user-specific WHERE clauses
never hit the result cache — each user's variation is a different SQL textVirtual Warehouses — Sizing, Scaling, and Cost Control
A virtual warehouse is Snowflake's unit of compute — a named cluster of CPU and memory that executes queries. Getting virtual warehouse configuration wrong is the most common cause of both performance problems and unexpectedly high Snowflake bills.
Warehouse sizes and what they mean
SNOWFLAKE WAREHOUSE SIZES:
Size Credits/Hour Nodes Local Cache Best for
─────────────────────────────────────────────────────────────────────
X-Small 1 1 ~70 GB Light queries, small tables, dev
Small 2 2 ~140 GB Moderate queries, staging
Medium 4 4 ~280 GB dbt production runs, typical analytics
Large 8 8 ~560 GB Large dbt jobs, complex SQL
X-Large 16 16 ~1.1 TB Very large joins, ML feature extraction
2X-Large 32 32 ~2.2 TB Batch processing, largest datasets
3X-Large 64 64 ~4.4 TB Extreme scale (rare — very expensive)
4X-Large 128 128 ~8.8 TB Almost never needed
Credit cost (US East, 2026 approximate): $2–$4/credit/hour depending on plan
X-Small running 1 hour: 1 credit × $2 = $2
2X-Large running 1 hour: 32 credits × $2 = $64
SCALE UP (larger warehouse): single-query performance
A complex dbt model that spills to disk on Medium: run it on Large
A join between two 10B-row tables: needs X-Large to fit in memory
Useful for: single-job batch transformations, ML feature computation
SCALE OUT (multi-cluster warehouse): concurrency
Multiple small clusters serving many users simultaneously
Each cluster gets its own set of nodes
Incoming queries routed to least-busy cluster
Useful for: analyst-heavy workloads, BI tool dashboard loads
ALTER WAREHOUSE ANALYTICS_WH SET
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 5
SCALING_POLICY = 'ECONOMY'; -- or 'STANDARD'
ECONOMY: scale out only when existing clusters are fully loaded
STANDARD: scale out proactively when queue starts building
AUTO-SUSPEND AND AUTO-RESUME:
Auto-suspend: warehouse turns off after N minutes idle (no compute cost)
Auto-resume: first query after suspend wakes the warehouse (1-5 second delay)
RECOMMENDED SETTINGS BY WAREHOUSE PURPOSE:
dbt transformation warehouse: AUTO_SUSPEND = 300 (5 min)
Analyst self-service: AUTO_SUSPEND = 600 (10 min, keep warm for users)
Dashboard service account: AUTO_SUSPEND = 60 (1 min, cache warms fast)
ML training warehouse: AUTO_SUSPEND = 60 (rare large jobs)
Dev/test warehouse: AUTO_SUSPEND = 60 (minimise idle cost)
WAREHOUSE SIZING DECISION PROCESS:
1. Run the query on Medium first
2. Check query profile: is it spilling to disk?
→ Yes: memory is insufficient → size up to Large or X-Large
3. Check partitions scanned: is it scanning all partitions?
→ Yes: cluster keys or query filter issue — sizing up won't help
4. Check bytes scanned per second: is CPU the bottleneck?
→ Yes, and no spill: more nodes (scale out) may help for parallelism
5. Check queue time: are queries waiting for the warehouse?
→ Yes: scale out (multi-cluster) or separate workloads into dedicated warehousesWarehouse isolation — separating workloads
# ANTI-PATTERN: one warehouse for everything
# All users, all pipelines, all dashboards share one COMPUTE_WH
# dbt job runs 30-minute transformation → blocks analyst queries for 30 min
# Analyst runs expensive ad-hoc query → slows down dashboard for 5 min
# Result: unpredictable performance, no cost attribution, no isolation
# CORRECT PATTERN: dedicated warehouses per workload type
# Each has independent sizing, auto-suspend, and scaling policy
-- Create warehouses for different workload types:
CREATE WAREHOUSE dbt_pipeline_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE
COMMENT = 'dbt production pipeline runs — pipelines use this';
CREATE WAREHOUSE analyst_wh
WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 600
AUTO_RESUME = TRUE
MAX_CLUSTER_COUNT = 3 -- scale out for concurrent analyst queries
COMMENT = 'Analyst self-service queries and BI tools';
CREATE WAREHOUSE dashboard_wh
WAREHOUSE_SIZE = 'X-SMALL'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
COMMENT = 'Metabase/Tableau service account — small, frequently-used queries';
CREATE WAREHOUSE ml_wh
WAREHOUSE_SIZE = 'X-LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
COMMENT = 'ML feature extraction — rare, large, terminated after use';
-- Assign warehouses via roles:
GRANT USAGE ON WAREHOUSE dbt_pipeline_wh TO ROLE pipeline_role;
GRANT USAGE ON WAREHOUSE analyst_wh TO ROLE analyst_role;
GRANT USAGE ON WAREHOUSE dashboard_wh TO ROLE metabase_service_role;
-- Cost attribution: each warehouse's credit usage is separately trackable
SELECT warehouse_name, SUM(credits_used) AS total_credits
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('day', -30, CURRENT_DATE)
GROUP BY 1
ORDER BY 2 DESC;
-- Shows per-team compute cost clearly — enables chargeback to business unitsSnowflake vs BigQuery vs Redshift — Architectural Differences
These three warehouses dominate the market and appear in almost every data engineering job. They are architecturally different in ways that matter for performance tuning, cost management, and platform design.
| Dimension | Snowflake | BigQuery | Redshift |
|---|---|---|---|
| Storage model | Managed S3 (micro-partitions) | Capacitor columnar format on Colossus | Local attached storage per node (RA3: S3-backed) |
| Compute model | Named virtual warehouses — you size them | Serverless — you set slot reservations or on-demand | Named clusters — you provision node types and count |
| Scaling | Scale up/down warehouse size; multi-cluster for concurrency | Automatic — serverless, scales to terabytes per query | Manual cluster resize; Concurrency Scaling for bursts |
| Cost model | Pay per second of warehouse time + storage | On-demand: $6.25/TB scanned; or flat reservation slots | Pay per node-hour (provisioned) or per query (Serverless) |
| Performance tuning | Cluster keys, warehouse sizing, result cache management | Partition and cluster tables, avoid SELECT *, avoid cross-joins | Distribution keys, sort keys, VACUUM, ANALYZE, WLM queues |
| Concurrency | Multi-cluster warehouse for concurrent users | Unlimited — serverless handles concurrency automatically | WLM queues + Concurrency Scaling for bursts |
| JSON / semi-structured | VARIANT type — native JSON with dot notation access | JSON/RECORD types — nested and repeated fields | SUPER type (Redshift PartiQL) — less mature |
| Data sharing | Native — share live data across accounts without copying | Analytics Hub — publish/subscribe data exchange | Data sharing across clusters in same region |
| Cloud native | Multi-cloud (AWS, Azure, GCP) — same experience | Google Cloud native | AWS native (deep integration with S3, Glue, SageMaker) |
| Best for (2026) | Multi-cloud, SQL-heavy platforms, strong data sharing | Google Cloud, serverless simplicity, petabyte queries | AWS-native stack, Redshift-native ML, tight S3 integration |
Redshift distribution keys and sort keys — the manual tuning that matters
# Redshift distributes data across multiple nodes.
# How data is distributed determines join performance.
# DISTRIBUTION STYLES:
# EVEN: rows distributed round-robin across nodes — good default for small tables
# KEY: rows with same distribution key value go to same node — good for joins
# ALL: entire table copied to every node — for small dimension tables
# AUTO: Redshift decides based on table size (modern default)
-- Tables joined on order_id should share the same distribution key:
CREATE TABLE silver.orders (
order_id BIGINT NOT NULL,
customer_id BIGINT,
amount DECIMAL(10,2),
...
) DISTKEY(order_id) SORTKEY(order_date);
-- order_id is the join column → co-locate orders and order_items on same node
-- order_date is the most common filter → sort key enables range pruning
CREATE TABLE silver.order_items (
order_id BIGINT,
product_id BIGINT,
quantity INTEGER,
...
) DISTKEY(order_id); -- same DISTKEY as orders → join without data movement
# WITHOUT MATCHING DISTKEYS:
# JOIN between orders (node 1,2,3) and order_items (node 4,5,6):
# Redshift must REDISTRIBUTE one table across the network before joining
# This is the most expensive join operation — 10-100× slower than co-located
# CHECKING QUERY DISTRIBUTION:
# In EXPLAIN output, look for:
# DS_BCAST_INNER: broadcasting inner table to all nodes (expensive for large tables)
# DS_DIST_BOTH: redistributing both tables (most expensive)
# DS_DIST_NONE: co-located join (fastest — what you want)
# SORT KEYS:
# COMPOUND SORTKEY(col1, col2): efficient for filters on col1 or (col1, col2)
# INTERLEAVED SORTKEY(col1, col2): equal efficiency for any subset — but VACUUM slower
# For time-series data: SORTKEY(ingestion_date, event_date) — match your query patterns
# VACUUM and ANALYZE:
# Redshift does not auto-update table statistics — must run manually:
VACUUM silver.orders; -- reclaim deleted space, re-sort unsorted region
ANALYZE silver.orders; -- update table statistics for query optimizer
-- Run after large loads. Without current statistics: query planner makes poor choices.Reading Query Plans — Finding the Bottleneck
A query plan is the execution strategy the database chose for your SQL. Reading it tells you exactly what the warehouse is doing — which operations take the most time, whether indexes or cluster keys are being used, and where the bottleneck is. Every data engineer should be able to read a basic query plan.
Snowflake query profile — what to look for
SNOWFLAKE QUERY PROFILE (accessed via UI after query runs):
Shows a node-by-node execution graph with time and data statistics.
OPERATOR TYPES YOU WILL SEE:
TableScan: reading micro-partitions from S3 or local cache
Filter: applying WHERE conditions
Join: HASH JOIN, MERGE JOIN, or NESTED LOOP
Aggregate: GROUP BY computation
Sort: ORDER BY
WindowFunction: OVER (PARTITION BY ... ORDER BY ...)
Result: sending results to client
Exchange: redistributing data between nodes for parallel ops
KEY METRICS PER OPERATOR:
Time (ms): how long this operator took
Rows: how many rows this operator processed
Bytes: how many bytes read/written
Partitions scanned: (TableScan only) vs total partitions
BOTTLENECK PATTERNS AND FIXES:
1. TableScan reads most partitions:
Metric: "Partitions scanned: 9,800 / 10,000"
Cause: No pruning — filter column not in cluster key
Fix: ALTER TABLE CLUSTER BY (filter_column)
Or: check that filter uses column directly, not inside function
2. Large HASH JOIN with spilling:
Metric: "Bytes written to local storage" > 0 (spill to disk)
Cause: Join tables too large for memory at this warehouse size
Fix: Size up the warehouse OR filter before joining to reduce size
Or: ensure smaller table is the build side of the hash join
3. Slow Aggregate:
Metric: Aggregate node takes most of query time
Cause: GROUP BY on many distinct values, many output rows
Fix: Pre-aggregate in Silver/Gold dbt model if this runs frequently
Push down filters to reduce rows before aggregation
4. WindowFunction with many rows:
Metric: WindowFunction node is slow, large data volume
Cause: OVER (PARTITION BY customer_id) when customer_id is unique
→ window function has to keep all rows in memory
Fix: Ensure PARTITION BY is on a selective column that groups rows
PRACTICAL USAGE:
After any query that is slower than expected:
1. Click "Query ID" in Snowflake UI → "Query Profile"
2. Find the slowest node (highest % of total time)
3. Read the metrics on that node to diagnose the cause
4. Apply the appropriate fix above
Most query performance issues fall into one of three categories:
partition pruning not working, join order wrong, or warehouse too small.EXPLAIN in PostgreSQL — reading row estimates and scan types
EXPLAIN ANALYZE
SELECT o.order_id, c.name, o.amount
FROM silver.orders o
JOIN silver.customers c ON o.customer_id = c.customer_id
WHERE o.order_date = '2026-03-17';
-- Sample output:
-- Hash Join (cost=12847.00..98432.00 rows=48234 width=28)
-- (actual time=124ms..892ms rows=48109 loops=1)
-- Hash Cond: (o.customer_id = c.customer_id)
-- -> Index Scan using idx_orders_date on orders o
-- (cost=0.44..48291.00 rows=48234 width=20)
-- (actual time=0.04ms..124ms rows=48234 loops=1)
-- Index Cond: (order_date = '2026-03-17'::date)
-- -> Hash (cost=7823.00..7823.00 rows=812345 width=16)
-- (actual time=98ms..98ms rows=812345 loops=1)
-- -> Seq Scan on customers c
-- (cost=0.00..7823.00 rows=812345 width=16)
-- (actual time=0.03ms..54ms rows=812345 loops=1)
-- Planning Time: 1.2 ms
-- Execution Time: 897 ms
-- WHAT TO READ:
-- Index Scan (vs Seq Scan): orders is using an index on order_date ✓
-- → index was used for the WHERE filter — efficient
-- Seq Scan on customers: full table scan — no index used
-- → 812,345 rows scanned just to build hash table — expected for large dim table
-- Hash Join: one table (customers) loaded into hash table, other (orders) probed
-- → orders is the probe side (smaller matching set) — correct order ✓
-- actual time vs cost: actual rows 48,109 vs estimated 48,234 — close ✓
-- if actual >> estimated: stale statistics, run ANALYZE
-- Execution Time 897ms: reasonable for 48k rows joining to 812k rows
-- RED FLAGS IN EXPLAIN OUTPUT:
-- Seq Scan on large table with filter → missing index
-- Hash Join with very different estimated vs actual rows → stale stats → ANALYZE
-- Nested Loop with large outer table → wrong join type for this size → HashJoin needed
-- Sort on large data set without index → ORDER BY without index → add index or avoidCutting a Snowflake Bill by 60% Without Losing Performance
FreshMart's Snowflake bill grew from $4,200/month to $11,800/month over three months. The data team is asked to investigate and reduce the cost. Here is the systematic approach.
-- STEP 1: Find the top credit consumers
SELECT warehouse_name,
SUM(credits_used) AS total_credits,
SUM(credits_used) * 2 AS approx_usd_cost,
COUNT(*) AS query_count
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('month', -3, CURRENT_DATE)
GROUP BY 1
ORDER BY 2 DESC;
-- Results:
-- COMPUTE_WH 4,890 credits $9,780 182,433 queries ← 83% of bill
-- DBT_PIPELINE_WH 420 credits $840 284 queries
-- ML_WH 80 credits $160 8 queries
-- COMPUTE_WH is the problem — one warehouse doing everything.
-- STEP 2: Find the most expensive queries on COMPUTE_WH
SELECT query_text,
total_elapsed_time / 1000 / 60 AS duration_min,
credits_used_cloud_services AS cloud_credits,
partitions_scanned,
partitions_total,
bytes_scanned / 1024 / 1024 / 1024 AS gb_scanned
FROM snowflake.account_usage.query_history
WHERE warehouse_name = 'COMPUTE_WH'
AND start_time >= DATEADD('month', -1, CURRENT_DATE)
ORDER BY credits_used_cloud_services DESC
LIMIT 20;
-- Top 3 queries explain 60% of the cost:
-- 1. dbt gold.customer_ltv model: 28 min, 892 partitions/892 total (100% scan)
-- 2. Metabase dashboard refresh: 12 min, runs every 5 min (12× per hour)
-- 3. Analyst's ad-hoc explore: 45 min, ran 8 times today
-- STEP 3: Fix the top issues
-- Issue 1: dbt Gold model scanning all partitions
-- Check query profile → TableScan shows no pruning on customers table
-- customers table has no CLUSTER BY
ALTER TABLE silver.customers CLUSTER BY (tier, city);
-- After reclustering: customer_ltv model scans 8% of partitions → 28min → 4min
-- Issue 2: Metabase dashboard refreshing every 5 minutes
-- Gold metrics don't change more than once per hour (Silver pipeline is hourly)
-- Fix: change Metabase refresh to 60 minutes
-- Additionally: Metabase now hits result cache on most refreshes → near-instant
-- Issue 3: Analyst using X-Large warehouse for small queries
-- Query profile shows: 95% idle, 5% actual work
-- Analyst's connection string was set to use compute_wh (X-Large by mistake)
-- Fix: reassign analyst to analyst_wh (Small), add query timeout 10 minutes
ALTER USER analyst_priya SET DEFAULT_WAREHOUSE = 'ANALYST_WH';
-- Issue 4: One warehouse for everything — separate workloads
-- Create dedicated warehouses (as shown in Part 05)
-- dbt_pipeline_wh (Medium), analyst_wh (Small+multi-cluster), dashboard_wh (X-Small)
-- STEP 4: Verify improvement
-- After 30 days with all fixes:
-- Total credits: 1,840 (was 5,390) → 66% reduction
-- Query performance: unchanged or improved (better pruning for dbt model)
-- Analyst experience: better (dedicated warehouse, no contention with pipeline)The cost reduction was entirely from better configuration — no query rewrites, no architecture changes. Separating warehouses, fixing cluster keys, adjusting dashboard refresh rates, and ensuring analysts used the right warehouse size reduced the bill by 66%. Most Snowflake overspend comes from these operational patterns rather than intrinsically expensive queries.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Columnar storage is faster for analytics because queries only read the columns they need. A query using 3 of 200 columns reads 1.5% of the data vs 100% in row storage. Columnar also compresses better — homogeneous column data enables run-length encoding, dictionary encoding, and delta encoding that mixed-type row blocks cannot match.
- ✓Choose correct data types for maximum compression: DECIMAL for amounts (not VARCHAR), TIMESTAMPTZ for timestamps (enables delta encoding), BOOLEAN for flags (bit packing), properly-sized VARCHAR lengths. Never store numbers as strings — loses all numeric compression and comparison performance.
- ✓Snowflake has three layers: Cloud Services (query planner, metadata, result cache — always on), Virtual Warehouse (compute with local SSD cache — pay per second), Storage (micro-partitions on S3 — cheap). Understanding which layer a bottleneck is in determines the fix.
- ✓Snowflake micro-partition pruning works by comparing filter values to per-partition min/max metadata without reading any data. Pruning requires: filter on a column whose ordering is correlated with partition boundaries, no function applied to the filter column. Never apply functions to filter columns in WHERE clauses.
- ✓Cluster keys (CLUSTER BY) re-organise data so rows with similar values in the cluster key columns are co-located in the same micro-partitions. Use cluster keys for columns frequently used in WHERE filters where natural loading order does not provide pruning. Monitor clustering depth — values above 5 indicate reclustering is needed.
- ✓Snowflake's result cache returns query results instantly if the exact same SQL was run within 24 hours and no underlying data changed. Schedule BI dashboard refreshes after pipeline completion (data-driven), not on a fixed interval that may run before the pipeline and serve stale cached results.
- ✓Use separate virtual warehouses for separate workload types. dbt transformation pipeline, analyst self-service, and BI dashboard service accounts all have different resource needs and should not compete on the same warehouse. Dedicated warehouses enable per-team cost attribution and prevent pipeline runs from blocking analyst queries.
- ✓In Redshift, distribution keys (DISTKEY) determine which node stores each row. Tables frequently joined together should share the same distribution key on the join column — this enables DS_DIST_NONE (no redistribution at join time). DS_DIST_BOTH in EXPLAIN means both tables are being redistributed — the most expensive join type.
- ✓For Redshift performance: always run VACUUM after large loads (reclaims deleted space, re-sorts data) and ANALYZE (updates query planner statistics). Without current statistics, the query planner makes poor join order and join type choices. These must be run manually — Redshift does not auto-update statistics.
- ✓Read query profiles after any slow query. In Snowflake: check partitions scanned vs total (pruning), spill to disk (warehouse too small), and the slowest operator (bottleneck). In PostgreSQL EXPLAIN ANALYZE: check for Seq Scan where Index Scan is expected, large discrepancy between estimated and actual rows (stale statistics), and DS_DIST_BOTH in Redshift (wrong distribution keys).
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.