AWS SageMaker — Training Jobs and Pipelines
SageMaker training jobs, SageMaker Pipelines, Feature Store, Clarify for bias detection, and JumpStart model hub. Production ML on AWS from scratch.
Everything from Modules 69–74 — pipelines, experiment tracking, model registry, deployment, monitoring — exists as a managed service on AWS. SageMaker is the platform so you do not have to build and maintain that infrastructure yourself.
The MLOps section built every component from scratch: Prefect for pipelines, MLflow for experiment tracking, FastAPI + Docker + Kubernetes for deployment, Evidently for monitoring, DVC for data versioning. Amazon SageMaker bundles equivalent versions of all of these into a single managed service. You still write the same Python training scripts — the platform handles compute provisioning, job scheduling, artifact storage, endpoint scaling, and monitoring dashboards.
Think of SageMaker as a managed MLOps platform — it is like having AWS automatically run your MLflow tracking server, Kubernetes cluster, model registry, and feature store so you only pay for what you use and never manage the servers. Your training code stays the same; SageMaker is just the environment it runs in.
SageMaker's key advantage over running MLOps yourself is managed compute. You do not pre-provision servers. You request a training job and SageMaker spins up the exact instance type you need, runs your script, saves artifacts to S3, then shuts the instance down. You pay per second of compute used. At scale, this is dramatically cheaper than keeping Kubernetes nodes warm 24/7.
SageMaker SDK — connect your local environment to AWS
The SageMaker Python SDK wraps the AWS APIs into high-level objects:Session, Estimator,Pipeline. You install it alongside boto3 (the low-level AWS SDK) and authenticate via an IAM role.
Every SageMaker job runs as an IAM role, not as your personal AWS user. The role needs AmazonSageMakerFullAccess plus S3 read/write on your training bucket. In production, create a least-privilege role that only grants access to the specific S3 paths and ECR repositories your jobs need.
Upload training data to S3
SageMaker training jobs read data from S3. The SDK provides a helper to upload a local directory and return the S3 URI.
Training Jobs — run any script on managed compute
A SageMaker Training Job is the equivalent of an AML Command Job: you point to a script, specify an instance type, and SageMaker provisions the machine, runs the script, saves /opt/ml/model/to S3, then terminates the instance. Your training script is unchanged from local development.
The training script (unchanged from local)
SageMaker mounts each input channel as a directory and setsSM_CHANNEL_<NAME>. Pass inputs={{'train': s3_uri}} in the Estimator and your script sees SM_CHANNEL_TRAIN pointing to that data — no boto3 download code needed.
Submit with the Estimator API
Monitor from the CLI
SageMaker Pipelines — chain prepare → train → evaluate as a reusable DAG
A single Estimator runs one script. A Pipeline chains multiple steps together — the output artifact of one step flows automatically into the next. This is SageMaker's equivalent of the Prefect flow you built in Module 69 and the AML Pipeline from Module 76. SageMaker Pipelines add managed data passing, step-level caching (skip unchanged steps), a visual DAG in Studio, and a cron/event trigger for automated retraining.
Add cache_config=CacheConfig(enable_caching=True, expire_after='30d')to any step. If the step inputs (data URI + code hash) match a previous run within the expiry window, SageMaker reuses the output artifact instead of re-running. A daily retraining pipeline that detects no new data will skip the expensive training step automatically.
SageMaker Feature Store — write features once, use them everywhere
Without a feature store, every ML team writes its own pipeline to compute the same features (average delivery time per postcode, customer order frequency, etc.). Those pipelines diverge — training uses one version, the inference service uses a slightly different one, and the resulting skew quietly degrades model accuracy. SageMaker Feature Store is a centralised repository: compute a feature once, write it to the store, and all models read the identical values at both training time and inference time.
SageMaker Clarify — detect bias and explain predictions
Clarify runs as a processing job that produces a bias report and feature importance report. You run it once after training to check for pre-training bias in the dataset (are certain postcodes underrepresented?) and post-training bias in model predictions (does the model systematically over-predict delivery time for certain areas?). You can also attach Clarify to a real-time endpoint to get per-prediction SHAP explanations.
Is one group underrepresented in training data?
Do groups get different rates of positive predictions?
Does model accuracy differ across groups?
Which features most influenced each prediction?
SageMaker JumpStart — deploy foundation models in minutes
JumpStart is AWS's model hub — pre-trained foundation models (Llama, Mistral, Stable Diffusion, etc.) and classic ML models that you can deploy to a SageMaker endpoint with a few lines of code. No container building, no custom inference code — JumpStart handles the serving infrastructure. Use it when you want to fine-tune a foundation model or build a quick baseline before training from scratch.
JumpStart is like PyPI for production ML models. Instead ofpip install transformers and writing a Flask server, you write JumpStartModel(model_id='...').deploy() and AWS handles the container, GPU driver, scaling, and HTTPS endpoint.
Common SageMaker errors and how to fix them
🎯 Key Takeaways
- ✓SageMaker Training Jobs provision compute on demand — you pay per second and the instance terminates automatically when done.
- ✓Your training script is unchanged — SageMaker injects data paths via SM_CHANNEL_* env vars and uploads SM_MODEL_DIR to S3.
- ✓SageMaker Pipelines chain steps as a DAG with automatic data passing, step caching, and cron/event triggers.
- ✓Feature Store solves train/serve skew — write features once, read identical values from the offline store (training) and online store (inference).
- ✓Clarify runs pre- and post-training bias analysis and SHAP explanations as a processing job — no model code changes required.
- ✓JumpStart deploys foundation models in minutes; use JumpStartEstimator to fine-tune them on your own data.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.