GCP IAM for Data Engineers — Access Control Without the Confusion
GCP IAM (Identity and Access Management) confuses most newcomers because it looks similar to AWS IAM and Azure RBAC but works differently. Understanding how to grant the right access to the right resources is essential for building secure data pipelines on GCP.
The three concepts you need
Member: who is getting access. Can be a Google account, service account, Google group, or domain.
Role: what access is being granted. A role is a collection of permissions. predefined roles (roles/bigquery.dataEditor) bundle common permissions. Custom roles let you create exactly the permission set you need.
Binding: the connection between member and role on a specific resource. Grant service account X the role roles/bigquery.dataEditor on dataset Y.
Service accounts for pipelines
Service accounts are the GCP equivalent of AWS IAM roles or Azure Service Principals. Your Dataflow jobs, Composer DAGs, and GCE instances use service accounts to authenticate to other GCP services.
Best practice: create one service account per workload with only the permissions it needs. A Dataflow job that reads from Pub/Sub and writes to BigQuery needs:
- roles/pubsub.subscriber on the subscription
- roles/bigquery.dataEditor on the target dataset
- roles/bigquery.jobUser on the project
Nothing more. Principle of least privilege.
Common data engineering role bindings
BigQuery analyst (read only): roles/bigquery.dataViewer + roles/bigquery.jobUser
Dataflow pipeline runner: roles/dataflow.developer + roles/bigquery.dataEditor + roles/storage.objectViewer
Composer DAG runner: roles/composer.worker + service-specific roles for each GCP service the DAGs call
GCS pipeline: roles/storage.objectCreator (write) + roles/storage.objectViewer (read)
Workload Identity — the modern approach
Workload Identity lets GKE workloads (including Composer and Dataflow) use service accounts without downloading key files. The workload authenticates using its Kubernetes service account, which is mapped to a GCP service account.
This eliminates the biggest security risk in GCP pipelines: service account JSON key files getting committed to Git or exposed in container images. Enable Workload Identity Federation on all new GCP data engineering projects.