Section 02 · GCP TrackIntermediate+150 XP

Google BigQuery

BigQuery is the crown jewel of GCP — a serverless, massively parallel data warehouse that queries terabytes in seconds. Zero infrastructure to manage, zero tuning required. The most powerful SQL analytics engine in any cloud.

15 min read March 2026

GCP Introduction Cloud Dataflow

What makes BigQuery different from every other warehouse

Every other data warehouse — Redshift, Synapse, Snowflake — requires you to provision compute capacity before you can run queries. You choose a cluster size, pay for it whether you use it or not, and tune it as your workload changes. BigQuery has none of that. You write SQL, Google runs it across thousands of machines automatically, and you pay only for the bytes scanned. When you are done querying, there is nothing running and nothing to pay.

This serverless model combined with Google's Dremel query engine — which splits queries across thousands of parallel workers in milliseconds — is why BigQuery can scan terabytes in seconds. It is genuinely unlike anything else in the cloud data warehouse space.

⚡

Serverless

Zero infrastructure. Zero cluster sizing. Zero idle cost. Pay only per query based on bytes scanned.

🔀

Massively Parallel

Google's Dremel engine splits queries across thousands of workers simultaneously. Terabytes in seconds.

💸

Columnar Storage

Data stored by column, not row. Queries that read 3 columns from a 100-column table only scan those 3 columns.

Partitioning and clustering — the most important optimization

BigQuery charges per byte scanned. Partitioning divides a table into segments by a column (usually date) so queries only scan relevant partitions. Clustering sorts data within partitions so BigQuery can skip blocks that do not match filter conditions. Together they can reduce query costs by 90% or more on large tables.

create_gold_table.sql

sql

-- BigQuery: create a partitioned, clustered table
-- Partitioning + clustering makes queries dramatically cheaper and faster

CREATE TABLE IF NOT EXISTS `your_project.gold.daily_sales_summary`
(
  order_date      DATE,
  region          STRING,
  product_category STRING,
  total_revenue   FLOAT64,
  order_count     INT64,
  avg_order_value FLOAT64,
  unique_customers INT64
)
PARTITION BY order_date          -- Only scan partitions matching the date filter
CLUSTER BY region, product_category  -- Sort within partitions for fast aggregations
OPTIONS (
  partition_expiration_days = 365,  -- Auto-delete partitions older than 1 year
  description = 'Gold layer: aggregated daily sales from Dataflow pipeline'
);

Writing analytics queries in BigQuery

BigQuery uses standard SQL with Google extensions. All the window functions, CTEs, and aggregations from the SQL for Data Engineers module work exactly the same way.

analytics_queries.sql

sql

-- BigQuery: real analytics queries
-- Window functions, CTEs, and aggregations all work exactly as in standard SQL

-- 1. Monthly revenue with month-over-month growth
WITH monthly AS (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    region,
    SUM(total_revenue) AS monthly_revenue
  FROM `your_project.gold.daily_sales_summary`
  WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)
  GROUP BY 1, 2
)
SELECT
  month,
  region,
  monthly_revenue,
  LAG(monthly_revenue) OVER (PARTITION BY region ORDER BY month) AS prev_month,
  ROUND(
    SAFE_DIVIDE(
      monthly_revenue - LAG(monthly_revenue) OVER (PARTITION BY region ORDER BY month),
      LAG(monthly_revenue) OVER (PARTITION BY region ORDER BY month)
    ) * 100, 2
  ) AS mom_growth_pct
FROM monthly
ORDER BY month DESC, monthly_revenue DESC;

Loading data from GCS using Python

bigquery_load.py

python

# BigQuery Python client — load data from Cloud Storage
from google.cloud import bigquery

client = bigquery.Client(project='your-project-id')

# Load Parquet files from GCS into BigQuery (Gold layer)
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.PARQUET,
    write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
    time_partitioning=bigquery.TimePartitioning(field='order_date'),
    clustering_fields=['region', 'product_category'],
)

load_job = client.load_table_from_uri(
    'gs://your-bucket/gold/daily_sales_summary/*.parquet',
    'your_project.gold.daily_sales_summary',
    job_config=job_config
)

load_job.result()  # Wait for job to complete
print(f"Loaded {client.get_table('your_project.gold.daily_sales_summary').num_rows} rows")

🎯 Pro Tip

Always partition by date and cluster by your most common filter/group columns. On a 1TB table, a well-partitioned and clustered query can cost 100x less than the same query on an unpartitioned table. This is the single most impactful thing you can do for BigQuery cost management.

🎯 Key Takeaways

✓BigQuery is serverless — zero infrastructure, zero cluster sizing, pay only for bytes scanned per query
✓Google's Dremel engine splits queries across thousands of parallel workers — terabytes scan in seconds
✓Always partition tables by date — queries with date filters only scan matching partitions, dramatically cutting costs
✓Clustering sorts data within partitions — add your most common filter/GROUP BY columns as cluster keys
✓BigQuery uses standard SQL — window functions, CTEs, and all SQL skills from the Foundations section apply directly
✓The Python client library loads data from GCS into BigQuery — the standard pattern for Dataflow pipeline output

📄 Resume Bullet Points

Copy these directly to your resume — tailored from this lesson

•

Designed Google BigQuery tables with date partitioning and clustering — reducing query costs by 85% on billion-row datasets

•

Built BigQuery Python pipelines loading Parquet files from GCS with schema auto-detection and time partitioning

•

Wrote BigQuery SQL analytical queries using ARRAY_AGG, STRUCT, and window functions for complex business reporting

🧠

Knowledge Check

5 questions · Earn 50 XP for passing · Score 60% or more to pass

What to learn next

Cloud Dataflow

GCP Track · 14 min · +200 XP

Cloud Pub/Sub

GCP Track · 12 min · +150 XP

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub