GCP Vertex AI — Pipelines and AutoML
Vertex AI Training, Pipelines, Feature Store, Model Registry, and online prediction endpoints. The GCP-native ML platform with best-in-class BigQuery integration.
Vertex AI is what happens when Google builds a managed ML platform on top of the infrastructure that runs Search, Maps, YouTube, and Gmail. BigQuery is the native data warehouse. TPUs are first-class compute. The Feature Store is the most production-ready managed one available.
Vertex AI is Google's unified ML platform — launched in 2021 by merging AI Platform, AutoML, and several other GCP ML services into a single product. It is the platform of choice at Indian companies that run on GCP: Ola, Juspay, ShareChat, Dunzo, and many analytics-heavy companies. Its distinguishing strengths over Azure ML and SageMaker: BigQuery integration is native and seamless (query data directly from training scripts without copying to object storage), the Vertex AI Feature Store is the most complete managed feature store across all three clouds, and TPU access is unique to GCP.
The Vertex AI SDK (google-cloud-aiplatform) is the Python interface. Like SageMaker, GCP services work together: Cloud Storage (equivalent of S3) holds data and artifacts, Artifact Registry (equivalent of ECR) holds Docker images, Cloud Logging holds job logs, and IAM manages permissions via service accounts. The mental model from the last two modules transfers directly — different names, same concepts.
Azure ML is a hotel, SageMaker is a city block, and Vertex AI is a university campus. Everything is Google-designed and integrated — the cafeteria (BigQuery) is connected to the research labs (Vertex Training) by a covered walkway, the library (Feature Store) is shared by all departments, and the campus bus (Vertex Pipelines) runs on a fixed schedule connecting everything. Off-campus services exist but the campus is designed to keep you within the Google ecosystem, and for data-heavy ML work the integration genuinely pays off.
The most important Vertex AI concept: everything is a resource with a resource name in the format projects/{project}/locations/{region}/resourceType/{id}. Every API call uses this format. Every log entry references it. Once you internalise this pattern, navigating Vertex AI becomes predictable — you always know where to look for anything.
GCS, IAM service accounts, Artifact Registry, and Vertex — four services that power every job
CustomTrainingJob — submit any Python script to Vertex managed compute
A Vertex AI CustomTrainingJob is equivalent to an AML Command Job and a SageMaker Training Job. You provide a Python script, a machine type, and optionally a custom Docker image. Vertex provisions a GCE instance, runs the script, streams logs to Cloud Logging, and uploads model artifacts to GCS. The instance terminates immediately when the job completes. Pre-built containers for scikit-learn, XGBoost, PyTorch, and TensorFlow eliminate the need to build custom Docker images for standard frameworks.
Vertex AI Pipelines — KFP components and pipelines with full lineage tracking
Vertex AI Pipelines is built on Kubeflow Pipelines (KFP) v2 — the same open-source pipeline framework used at Airbnb, Twitter, and many Indian companies running on-premise Kubernetes. Each step is a KFP component decorated with @component. Components are pure Python functions that declare typed inputs and outputs. The pipeline function wires components together — outputs of one step become inputs of the next. Vertex compiles the pipeline to an YAML artifact and runs it on managed infrastructure with full lineage tracking in the Vertex ML Metadata store.
Vertex AI Feature Store — define features once, serve at 1ms online and petabyte offline
Vertex AI Feature Store is widely considered the most production-ready managed feature store across all three major clouds. It solves the training-serving skew problem from Module 69 at scale — features are defined once, computed once, and served consistently to both the training pipeline (point-in-time correct historical values) and the inference endpoint (latest values at <1ms latency). BigQuery serves as the offline store. Bigtable or the Vertex-managed online store serves the online tier.
Vertex AI Online Predictions — deploy, call, and split traffic in three SDK calls
Every common Vertex AI mistake — explained and fixed
All three cloud ML platforms are covered. Next: MLOps on cloud — CI/CD for ML across all three platforms.
You have now covered Azure ML, SageMaker, and Vertex AI — the three platforms that run production ML at Indian enterprises and startups. Module 79 ties them together: MLOps on Cloud — how to build CI/CD pipelines for ML that work regardless of which cloud you are on. GitHub Actions triggering retraining, model quality gates in CI, automated deployment to staging and production, and the patterns that make the entire ML lifecycle repeatable from a single git push.
GitHub Actions triggering retraining, model quality gates in CI, automated deployment to staging and production across all three clouds.
🎯 Key Takeaways
- ✓Vertex AI is built on the same GCP infrastructure that runs Google Search and Gmail. Its differentiators over Azure ML and SageMaker: native BigQuery integration (query training data directly without copying to object storage), the most production-ready managed Feature Store, and first-class TPU access. Used at Ola, Juspay, ShareChat, and analytics-heavy Indian companies.
- ✓Every Vertex AI resource follows the naming pattern projects/{project}/locations/{region}/{resourceType}/{id}. All four supporting services work together: Cloud Storage for data and artifacts, IAM service accounts for permissions, Artifact Registry for Docker images, Cloud Logging for all job logs. When a job fails, check Cloud Logging first — the Python traceback is always there.
- ✓Vertex AI Pipelines uses KFP v2 components and pipelines. Lightweight @component decorators are convenient for simple steps but install packages at runtime — slow for large dependency sets. Container components (custom Docker images) are faster and should be used for any step that runs more than once. Pre-built Google components from google_cloud_pipeline_components handle AutoML, BigQuery export, and model upload.
- ✓Vertex AI Feature Store is the most complete managed feature store across all three clouds. BigQuery is the offline store (petabyte scale, point-in-time correct serving). Bigtable or Vertex-managed online store serves features at <1ms latency. batch_serve_to_bq() generates training datasets with point-in-time correct features. materialize() syncs offline to online on a schedule.
- ✓Online Endpoints support traffic splitting natively via the traffic_split parameter in deploy(). Canary: {'old_id': 90, '0': 10}. Full promotion: {'new_id': 100}. AutoML Tabular creates a dataset from BigQuery, runs Neural Architecture Search automatically, and returns a deployable model — the fastest path to a strong baseline with a budget_milli_node_hours cost cap.
- ✓Three common Vertex AI failures: PermissionDenied (grant roles/aiplatform.user + Storage Object Admin + BigQuery Data Viewer to the service account), KFP component failure (check Cloud Logging for the Python traceback — generic SDK error messages hide the real cause), slow components (pre-bake dependencies into a Docker image instead of using packages_to_install — eliminates 3-10 minute package install time per component run).
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.