Security and Compliance for Data Engineers
GDPR and the India DPDP Act — what they mean for your pipelines and how to build systems that are compliant by design.
Most data engineering tutorials teach you how to build pipelines that work. Almost none teach you how to build pipelines that are legal. That gap will cost you at some point — either in production when your company faces a GDPR audit, or in an interview when a hiring manager at Razorpay or PhonePe asks how you handle PII in your Kafka topics.
This module covers what you actually need to know as a data engineer: encryption, PII handling, access control, GDPR, and India's new Digital Personal Data Protection Act. Not legal theory — practical decisions your pipelines must make.
1. What You Are Actually Protecting Against
Security is not abstract. As a data engineer you have three concrete threats to think about:
2. Encryption — At Rest and In Transit
Encryption is the first line of defence. There are two distinct problems: data being intercepted while it moves (in transit), and data being read from disk if storage is compromised (at rest). They require different solutions.
Encryption in transit
Any time data moves across a network — from your pipeline to a database, from Kafka producer to broker, from your API client to S3 — it must be encrypted using TLS (Transport Layer Security). Without TLS, anyone on the network path can read your data in plain text.
| Component | How to enforce TLS |
|---|---|
| PostgreSQL / any DB | Set sslmode=require in the connection string. Never use disable. |
| Kafka | Configure listeners with SSL protocol. Set security.protocol=SSL on producers and consumers. |
| HTTP APIs | Always use https://. Reject http:// connections at load balancer level. |
| Azure Blob / ADLS | Enforce HTTPS-only traffic on storage account. Enabled by default — do not disable it. |
| S3 (AWS) | Bucket policy with aws:SecureTransport = false → Deny. This blocks HTTP access. |
| Cloud SQL / RDS | Enable require_ssl in DB flags. Provide CA certificate to application. |
Encryption at rest
Encryption at rest means data stored on disk is encrypted. If someone steals the physical disk or gets unauthorized access to raw storage, they see ciphertext, not your customer records.
On all major cloud platforms, encryption at rest is enabled by default for object storage (S3, Azure Blob, GCS) and managed databases. Your job is to make sure you are using the right key type and haven't accidentally disabled it.
| Key type | What it means | When to use it |
|---|---|---|
| SSE-S3 / SSM managed | Cloud provider manages the keys. Easy, free, zero ops. | Default for most data. Use unless compliance requires customer-managed keys. |
| Customer-Managed Keys (CMK) | You create and control keys in KMS / Azure Key Vault. You can rotate and revoke. | PII, financial data, healthcare. Required by PCI-DSS and many enterprise customers. |
| Client-side encryption | You encrypt before sending to the cloud. Cloud never sees plaintext. | Highest sensitivity. Significant operational overhead. Rare in practice. |
Column-level encryption for sensitive fields
Full-disk encryption protects you if storage is stolen. It does not protect you from a legitimate database user running SELECT email, phone FROM users. For fields like Aadhaar numbers, phone numbers, and payment card data, you need column-level encryption — the field is stored as ciphertext in the database, and only systems with the decryption key can read the real value.
# Column-level encryption with Python (Fernet symmetric encryption)
# Use this pattern when storing sensitive fields in your data warehouse
from cryptography.fernet import Fernet
import os
# Key should come from your secrets manager (AWS Secrets Manager, Azure Key Vault)
# NEVER hardcode keys in source code
ENCRYPTION_KEY = os.environ['COLUMN_ENCRYPTION_KEY']
fernet = Fernet(ENCRYPTION_KEY.encode())
def encrypt_field(value: str) -> str:
"""Encrypt a sensitive field before writing to the database."""
if value is None:
return None
return fernet.encrypt(value.encode()).decode()
def decrypt_field(encrypted_value: str) -> str:
"""Decrypt a field when it needs to be read."""
if encrypted_value is None:
return None
return fernet.decrypt(encrypted_value.encode()).decode()
# In your pipeline:
row = {
'user_id': 'U1234',
'name': 'Priya Sharma', # Not sensitive — store as is
'email': encrypt_field('priya@example.com'), # Sensitive — encrypt
'phone': encrypt_field('+91 9876543210'), # Sensitive — encrypt
'aadhaar_last4': encrypt_field('5678'), # Sensitive — encrypt
'city': 'Hyderabad', # Not sensitive — store as is
}
# Key rotation: generate new key, decrypt with old, re-encrypt with new
# This is an operational concern — document your key rotation schedule3. PII — Identifying and Handling Personal Data
PII stands for Personally Identifiable Information — any data that can directly identify a person or, in combination with other data, identify a person. As a data engineer, your job is to know what PII your pipelines touch, where it goes, and how it is protected at every step.
What counts as PII
| Type | Examples | Risk level |
|---|---|---|
| Direct identifiers | Full name, Aadhaar number, PAN, passport, phone, email | High — identifies person directly |
| Quasi-identifiers | Pincode + birthdate + gender (can re-identify when combined) | Medium — risky in combination |
| Sensitive personal data (DPDP / GDPR) | Health data, financial data, biometrics, caste, religion, sexual orientation | Very high — stricter rules apply |
| Derived data | Credit score, location history, behaviour profile built from raw data | High — still personal data even if derived |
| Pseudonymous data | user_id replacing email (mapping table exists separately) | Medium — still PII if re-identification is possible |
| Anonymous data | Aggregated stats with no re-identification path | Not PII — regulations do not apply |
The four things you must do with PII in pipelines
Only collect and store PII that you actually need. If your analytics pipeline only needs city-level location data, don't ingest lat/lon coordinates. If you need to count active users, use a hashed user_id, not the email address.
Every table and column containing PII should be tagged in your data catalogue. This is how you answer "where is our PII stored?" in 30 seconds instead of 30 days when an audit arrives.
-- Example: tagging in dbt schema.yml
models:
- name: orders
columns:
- name: customer_email
meta:
pii: true
pii_type: direct_identifier
gdpr_relevant: true
dpdp_relevant: true
- name: customer_phone
meta:
pii: true
pii_type: direct_identifierProduction data must never be used in development or testing environments without masking. Developers don't need real email addresses to debug a pipeline — they need data in the right format with the right shape.
# Data masking for dev/test environments
import hashlib
import re
def mask_email(email: str) -> str:
"""Replace real email with consistent but fake email."""
if not email:
return email
hashed = hashlib.sha256(email.encode()).hexdigest()[:8]
return f"user_{hashed}@masked.dev"
def mask_phone(phone: str) -> str:
"""Keep format, replace digits with X except last 4."""
digits = re.sub(r'D', '', phone)
return 'XXXXXX' + digits[-4:] if len(digits) >= 4 else 'XXXXXXXXXX'
def mask_aadhaar(aadhaar: str) -> str:
"""Standard Aadhaar masking — show only last 4."""
digits = re.sub(r'D', '', aadhaar)
return 'XXXX XXXX ' + digits[-4:] if len(digits) >= 4 else 'XXXX XXXX XXXX'
# Apply during the staging → dev copy process, not in production pipelinesAnalysts should not have raw access to the PII columns in your production tables. Use column masking policies (Databricks, Snowflake, BigQuery support this natively) so analysts see the masked value by default, and only privileged roles see the real value.
-- Snowflake: column masking policy
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_ENGINEER', 'PRIVACY_ADMIN') THEN val
ELSE CONCAT(LEFT(val, 2), '****@****.com')
END;
-- Apply to the column
ALTER TABLE customers
MODIFY COLUMN email
SET MASKING POLICY email_mask;
-- Analyst sees: pr****@****.com
-- Engineer sees: priya@freshmart.in4. Access Control — RBAC and Least Privilege
Access control is the answer to the insider threat. The principle is simple: every user and every system gets the minimum permissions they need to do their job — nothing more. This is called least privilege.
Role-Based Access Control (RBAC)
Instead of granting permissions to individual users, you define roles (Data Engineer, Analyst, Pipeline Service Account, Admin) and assign permissions to roles. Users are assigned to roles. When someone changes jobs, you change their role — not 47 individual permissions.
| Role | Typical permissions |
|---|---|
| Data Engineer | Read/write to raw, silver, gold layers. Create and modify pipelines. No access to prod secrets. |
| Analyst | Read-only on gold/reporting layer. Masked PII columns. No access to raw or silver. |
| Pipeline Service Account | Read source systems. Write to specific target tables only. No login access to database. |
| Privacy Admin | Read unmasked PII. Execute deletion jobs. Access to audit logs. |
| Admin | Full access. Requires approval workflow. Every action logged. |
Attribute-Based Access Control (ABAC)
RBAC works well when roles are stable. ABAC is more fine-grained — access is granted based on attributes of the user, the data, and the context. For example: "an analyst can read customer data only if the customer's region matches the analyst's assigned region." BigQuery, Databricks Unity Catalog, and Apache Ranger all support ABAC-style row-level and column-level security.
-- Row-level security in PostgreSQL
-- Each analyst can only see rows for their assigned region
CREATE POLICY region_isolation ON customers
USING (region = current_setting('app.user_region'));
ALTER TABLE customers ENABLE ROW LEVEL SECURITY;
-- In your application / pipeline connection:
-- SET app.user_region = 'south_india';
-- Now queries on customers only return south_india rows5. GDPR — What Data Engineers Need to Know
GDPR (General Data Protection Regulation) is a European Union law that came into force in 2018. It applies to any company that processes personal data of EU residents — including Indian companies that have EU customers. Fines go up to 4% of global annual revenue. Meta was fined €1.2 billion in 2023.
You don't need to read all 99 articles. As a data engineer, these are the 5 GDPR requirements that directly affect how you build pipelines.
Crypto-erasure — the practical way to handle deletion in data lakes
Deleting a record from a data warehouse is easy. Deleting it from an immutable data lake (S3/ADLS with versioning) is hard. The practical solution is crypto-erasure: encrypt the user's PII with a user-specific key stored in a key management service. To "delete" the user, delete their encryption key. All their encrypted data becomes permanently unreadable without modifying any files.
# Crypto-erasure pattern
# Each user's PII is encrypted with a unique per-user key
# Deletion = deleting the key from KMS
import boto3
kms = boto3.client('kms', region_name='ap-south-1')
def get_or_create_user_key(user_id: str) -> str:
"""Return KMS key ARN for this user, creating if needed."""
# In practice, store key ARN in a mapping table
response = kms.create_key(
Description=f'PII encryption key for user {user_id}',
Tags=[{'TagKey': 'user_id', 'TagValue': user_id}]
)
return response['KeyMetadata']['KeyId']
def erase_user(user_id: str, key_id: str):
"""
GDPR right to erasure via crypto-erasure.
Schedules key deletion — AWS KMS minimum waiting period is 7 days.
After deletion, all PII encrypted with this key is permanently unreadable.
"""
kms.schedule_key_deletion(
KeyId=key_id,
PendingWindowInDays=7 # Minimum allowed by AWS KMS
)
print(f"Key for user {user_id} scheduled for deletion. PII will be unreadable in 7 days.")
# Log this action to your audit trail
log_audit_event('ERASURE_REQUESTED', user_id=user_id, key_id=key_id)6. India Digital Personal Data Protection Act (DPDP) 2023
India's Digital Personal Data Protection Act was passed in August 2023. It is the first comprehensive personal data protection law in India, replacing a patchwork of older IT Act provisions. The rules (secondary legislation) were expected in 2024–2025 and are being finalized as of March 2026. The core obligations, however, are already clear.
Key concepts in DPDP for data engineers
| DPDP Term | Plain meaning | Your pipeline implication |
|---|---|---|
| Data Principal | The individual whose data is being processed (your user) | You must be able to identify all data for a given user_id across your systems |
| Data Fiduciary | The company that decides what data to collect and how to use it (your employer) | Your company must appoint a Data Protection Officer for significant fiduciaries |
| Consent | Must be free, specific, informed, and unambiguous. No pre-checked boxes. | Tag data with consent purpose. Don't use data beyond consented purpose. |
| Purpose limitation | Data used only for the specific purpose for which consent was given | Same as GDPR — documented business purpose required per field |
| Data erasure | User can request deletion. Company must delete when purpose is fulfilled. | Same deletion capability as GDPR. Deletion when retention period expires, not just on request. |
| Data localisation | Certain "significant" data fiduciaries may be required to store data in India | Watch for storage region requirements — may affect your cloud region choice |
| Children's data | Parental consent required for users under 18. No behavioural tracking of children. | If your platform has minors, age verification and restricted processing required |
GDPR vs DPDP — similarities and differences
| Area | GDPR | India DPDP |
|---|---|---|
| Scope | EU residents' data | Indian residents' digital data |
| Legal basis | Consent, legitimate interest, contract, legal obligation, vital interest, public task | Consent and "legitimate uses" (state functions, employment, emergencies, research) |
| Right to erasure | Yes — 30 days | Yes — timeline per rules (expected similar) |
| Right to access | Yes — detailed Subject Access Request | Yes — right to access information about data processed |
| Data breach notification | 72 hours to regulator | Without delay to Data Protection Board (timeline per rules) |
| Fines | Up to €20M or 4% global revenue | Up to ₹250 crore per instance (rules may revise) |
| DPO requirement | Required for certain organisations | Required for "Significant Data Fiduciaries" (defined by rules) |
| Cross-border transfer | Adequacy decisions or standard clauses | Allowed except to countries notified as restricted |
7. Compliance by Design — A Practical Checklist
Compliance bolted on after a pipeline is live is expensive and incomplete. Compliance built into the pipeline from the start is cheap and reliable. Here is the checklist you run when designing any pipeline that touches personal data.
8. Audit Logging
Audit logs answer "who did what, to which data, and when." They are your proof of compliance, your first tool in a breach investigation, and your defence in a regulatory audit. They are also one of the most commonly skipped pieces of data infrastructure.
# Minimal audit log event — write this to an immutable audit log table
# or a WORM (Write Once Read Many) log bucket
from datetime import datetime, timezone
import json
def log_audit_event(
action: str, # READ_PII, DELETE_RECORD, EXPORT_DATA, SCHEMA_CHANGE
actor: str, # user_id or service_account_name of who did it
resource: str, # table name, pipeline name, file path
record_id: str = None, # user_id or record_id affected (if applicable)
metadata: dict = None, # any additional context
):
event = {
'timestamp': datetime.now(timezone.utc).isoformat(),
'action': action,
'actor': actor,
'resource': resource,
'record_id': record_id,
'metadata': metadata or {},
}
# Write to immutable audit log — append only, no updates, no deletes
# Options: Cloud Storage with object lock, dedicated audit table, CloudTrail, Azure Monitor
print(json.dumps(event)) # Replace with your log sink
# Examples
log_audit_event('READ_PII', 'analyst_ravi', 'customers', metadata={'purpose': 'support_ticket_123'})
log_audit_event('DELETE_RECORD', 'privacy_admin', 'customers', record_id='U98765', metadata={'reason': 'GDPR_erasure_request'})
log_audit_event('EXPORT_DATA', 'data_engineer', 'orders', record_id='U12345', metadata={'reason': 'DPDP_access_request'})9. Secrets Management — Never Hardcode Credentials
This is one of the most common mistakes made by junior data engineers: database passwords, API keys, and storage account keys hardcoded in Python scripts or committed to Git. A secret in your Git history is a secret that was leaked — even if you delete the commit later, it may already be in a fork or a scan.
| What not to do | What to do instead |
|---|---|
| password = "Freshm@rt123!" | password = os.environ["DB_PASSWORD"] |
| connection_string = "Server=...;Password=abc;" | Fetch from Azure Key Vault / AWS Secrets Manager at runtime |
| API key hardcoded in Airflow DAG file | Airflow Variables or Connections (encrypted in Airflow metadata DB) |
| Storage account key in ADF linked service definition | Use Managed Identity — no key at all. ADF authenticates via Azure AD. |
| Credentials in Docker environment file committed to Git | .env in .gitignore — inject via CI/CD secrets or Kubernetes secrets |
# Fetching secrets from AWS Secrets Manager at runtime
import boto3
import json
def get_secret(secret_name: str, region: str = 'ap-south-1') -> dict:
"""Fetch a secret by name. Returns dict of key-value pairs."""
client = boto3.client('secretsmanager', region_name=region)
response = client.get_secret_value(SecretId=secret_name)
return json.loads(response['SecretString'])
# In your pipeline
db_secret = get_secret('freshmart/prod/postgres')
connection_string = (
f"postgresql://{db_secret['username']}:{db_secret['password']}"
f"@{db_secret['host']}:{db_secret['port']}/{db_secret['dbname']}"
)
# Azure equivalent: use DefaultAzureCredential + Key Vault
# from azure.keyvault.secrets import SecretClient
# from azure.identity import DefaultAzureCredential
# client = SecretClient(vault_url="https://kv-freshmart.vault.azure.net/", credential=DefaultAzureCredential())
# secret = client.get_secret("postgres-password").value10. What This Looks Like at Work
Your first task might be: "We have a new DPDP compliance requirement — audit the raw layer and flag every column that contains personal data." You open the data catalogue, run a query across column names and sample values, and produce a spreadsheet with every PII field, its table, and its current protection status. That's a real day-one task, and it requires knowing what PII is.
Health data is "sensitive personal data" under DPDP. Your pipeline that ingests patient records into the data warehouse must use customer-managed encryption keys, full audit logging on every SELECT, and row-level security so only the assigned doctor's team can see their patients' records. A senior engineer will review your pipeline and specifically check these controls before approving.
"How would you handle a GDPR deletion request in your current architecture?" is a common senior DE interview question. The right answer covers: locating all records for the user_id across all layers, the deletion mechanism (hard delete vs crypto-erasure for the lake), updating aggregates if the user contributed to pre-computed tables, and logging the deletion with a timestamp and actor.
Errors You'll Hit
🎯 Key Takeaways
- ✓Encryption in transit (TLS) and at rest are separate problems — you need both.
- ✓PII is any data that can identify a person, directly or in combination. Tag it, minimise it, mask it in dev, and control access to it in prod.
- ✓GDPR and India DPDP both require: consent, purpose limitation, right to erasure, right to access, and breach notification. Build deletion and access-export capability into every pipeline that touches personal data.
- ✓Crypto-erasure is the practical solution for GDPR deletion in immutable data lakes — encrypt per user with a unique key, then delete the key.
- ✓Audit logs must be immutable, append-only, and stored separately from the systems being audited.
- ✓Never hardcode credentials. Use environment variables in dev, secrets managers in production, and managed identities where the cloud supports them.
- ✓Compliance is cheapest when built in at design time. Retrofitting a non-compliant pipeline is 10× harder.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.