Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Advanced+200 XP

Security and Compliance for Data Engineers

GDPR and the India DPDP Act — what they mean for your pipelines and how to build systems that are compliant by design.

50 min March 2026
Last verified March 2026 — GDPR (2018), India DPDP Act (2023)

Most data engineering tutorials teach you how to build pipelines that work. Almost none teach you how to build pipelines that are legal. That gap will cost you at some point — either in production when your company faces a GDPR audit, or in an interview when a hiring manager at Razorpay or PhonePe asks how you handle PII in your Kafka topics.

This module covers what you actually need to know as a data engineer: encryption, PII handling, access control, GDPR, and India's new Digital Personal Data Protection Act. Not legal theory — practical decisions your pipelines must make.

🎯 Pro Tip
You don't need a law degree. You need to understand the rules well enough to ask the right questions and build systems that don't create problems for your company. Legal advice comes from lawyers. Pipeline design comes from you.

1. What You Are Actually Protecting Against

Security is not abstract. As a data engineer you have three concrete threats to think about:

👤
Insider access
A junior analyst can SELECT * from the customer table and export 2 million email addresses to a CSV. Most breaches come from inside, not outside.
🌐
External breach
An attacker who gets into your Kafka broker or S3 bucket reads every event your system has ever produced. Data at rest must be encrypted.
📋
Regulatory audit
A regulator asks you to prove that user X's data was deleted within 30 days of their deletion request. Can you? Can you prove it?

2. Encryption — At Rest and In Transit

Encryption is the first line of defence. There are two distinct problems: data being intercepted while it moves (in transit), and data being read from disk if storage is compromised (at rest). They require different solutions.

Encryption in transit

Any time data moves across a network — from your pipeline to a database, from Kafka producer to broker, from your API client to S3 — it must be encrypted using TLS (Transport Layer Security). Without TLS, anyone on the network path can read your data in plain text.

ComponentHow to enforce TLS
PostgreSQL / any DBSet sslmode=require in the connection string. Never use disable.
KafkaConfigure listeners with SSL protocol. Set security.protocol=SSL on producers and consumers.
HTTP APIsAlways use https://. Reject http:// connections at load balancer level.
Azure Blob / ADLSEnforce HTTPS-only traffic on storage account. Enabled by default — do not disable it.
S3 (AWS)Bucket policy with aws:SecureTransport = false → Deny. This blocks HTTP access.
Cloud SQL / RDSEnable require_ssl in DB flags. Provide CA certificate to application.
⚠️ Important
TLS only protects data while it is moving. Once data lands in your database or object storage, transit encryption does nothing. You need separate encryption at rest for that.

Encryption at rest

Encryption at rest means data stored on disk is encrypted. If someone steals the physical disk or gets unauthorized access to raw storage, they see ciphertext, not your customer records.

On all major cloud platforms, encryption at rest is enabled by default for object storage (S3, Azure Blob, GCS) and managed databases. Your job is to make sure you are using the right key type and haven't accidentally disabled it.

Key typeWhat it meansWhen to use it
SSE-S3 / SSM managedCloud provider manages the keys. Easy, free, zero ops.Default for most data. Use unless compliance requires customer-managed keys.
Customer-Managed Keys (CMK)You create and control keys in KMS / Azure Key Vault. You can rotate and revoke.PII, financial data, healthcare. Required by PCI-DSS and many enterprise customers.
Client-side encryptionYou encrypt before sending to the cloud. Cloud never sees plaintext.Highest sensitivity. Significant operational overhead. Rare in practice.

Column-level encryption for sensitive fields

Full-disk encryption protects you if storage is stolen. It does not protect you from a legitimate database user running SELECT email, phone FROM users. For fields like Aadhaar numbers, phone numbers, and payment card data, you need column-level encryption — the field is stored as ciphertext in the database, and only systems with the decryption key can read the real value.

# Column-level encryption with Python (Fernet symmetric encryption)
# Use this pattern when storing sensitive fields in your data warehouse

from cryptography.fernet import Fernet
import os

# Key should come from your secrets manager (AWS Secrets Manager, Azure Key Vault)
# NEVER hardcode keys in source code
ENCRYPTION_KEY = os.environ['COLUMN_ENCRYPTION_KEY']
fernet = Fernet(ENCRYPTION_KEY.encode())

def encrypt_field(value: str) -> str:
    """Encrypt a sensitive field before writing to the database."""
    if value is None:
        return None
    return fernet.encrypt(value.encode()).decode()

def decrypt_field(encrypted_value: str) -> str:
    """Decrypt a field when it needs to be read."""
    if encrypted_value is None:
        return None
    return fernet.decrypt(encrypted_value.encode()).decode()

# In your pipeline:
row = {
    'user_id': 'U1234',
    'name': 'Priya Sharma',          # Not sensitive — store as is
    'email': encrypt_field('priya@example.com'),   # Sensitive — encrypt
    'phone': encrypt_field('+91 9876543210'),       # Sensitive — encrypt
    'aadhaar_last4': encrypt_field('5678'),         # Sensitive — encrypt
    'city': 'Hyderabad',             # Not sensitive — store as is
}

# Key rotation: generate new key, decrypt with old, re-encrypt with new
# This is an operational concern — document your key rotation schedule
💡 Note
In a lakehouse (Delta Lake, Iceberg), column-level encryption is often handled by the query engine (Databricks Unity Catalog, Apache Ranger) rather than application code. Understand the tool your company uses. The concept is the same.

3. PII — Identifying and Handling Personal Data

PII stands for Personally Identifiable Information — any data that can directly identify a person or, in combination with other data, identify a person. As a data engineer, your job is to know what PII your pipelines touch, where it goes, and how it is protected at every step.

What counts as PII

TypeExamplesRisk level
Direct identifiersFull name, Aadhaar number, PAN, passport, phone, emailHigh — identifies person directly
Quasi-identifiersPincode + birthdate + gender (can re-identify when combined)Medium — risky in combination
Sensitive personal data (DPDP / GDPR)Health data, financial data, biometrics, caste, religion, sexual orientationVery high — stricter rules apply
Derived dataCredit score, location history, behaviour profile built from raw dataHigh — still personal data even if derived
Pseudonymous datauser_id replacing email (mapping table exists separately)Medium — still PII if re-identification is possible
Anonymous dataAggregated stats with no re-identification pathNot PII — regulations do not apply

The four things you must do with PII in pipelines

1 — Minimise

Only collect and store PII that you actually need. If your analytics pipeline only needs city-level location data, don't ingest lat/lon coordinates. If you need to count active users, use a hashed user_id, not the email address.

2 — Classify and tag

Every table and column containing PII should be tagged in your data catalogue. This is how you answer "where is our PII stored?" in 30 seconds instead of 30 days when an audit arrives.

-- Example: tagging in dbt schema.yml
models:
  - name: orders
    columns:
      - name: customer_email
        meta:
          pii: true
          pii_type: direct_identifier
          gdpr_relevant: true
          dpdp_relevant: true
      - name: customer_phone
        meta:
          pii: true
          pii_type: direct_identifier
3 — Mask or pseudonymise in non-production

Production data must never be used in development or testing environments without masking. Developers don't need real email addresses to debug a pipeline — they need data in the right format with the right shape.

# Data masking for dev/test environments
import hashlib
import re

def mask_email(email: str) -> str:
    """Replace real email with consistent but fake email."""
    if not email:
        return email
    hashed = hashlib.sha256(email.encode()).hexdigest()[:8]
    return f"user_{hashed}@masked.dev"

def mask_phone(phone: str) -> str:
    """Keep format, replace digits with X except last 4."""
    digits = re.sub(r'D', '', phone)
    return 'XXXXXX' + digits[-4:] if len(digits) >= 4 else 'XXXXXXXXXX'

def mask_aadhaar(aadhaar: str) -> str:
    """Standard Aadhaar masking — show only last 4."""
    digits = re.sub(r'D', '', aadhaar)
    return 'XXXX XXXX ' + digits[-4:] if len(digits) >= 4 else 'XXXX XXXX XXXX'

# Apply during the staging → dev copy process, not in production pipelines
4 — Control access

Analysts should not have raw access to the PII columns in your production tables. Use column masking policies (Databricks, Snowflake, BigQuery support this natively) so analysts see the masked value by default, and only privileged roles see the real value.

-- Snowflake: column masking policy
CREATE OR REPLACE MASKING POLICY email_mask AS (val STRING)
RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('DATA_ENGINEER', 'PRIVACY_ADMIN') THEN val
    ELSE CONCAT(LEFT(val, 2), '****@****.com')
  END;

-- Apply to the column
ALTER TABLE customers
  MODIFY COLUMN email
  SET MASKING POLICY email_mask;

-- Analyst sees: pr****@****.com
-- Engineer sees: priya@freshmart.in

4. Access Control — RBAC and Least Privilege

Access control is the answer to the insider threat. The principle is simple: every user and every system gets the minimum permissions they need to do their job — nothing more. This is called least privilege.

Role-Based Access Control (RBAC)

Instead of granting permissions to individual users, you define roles (Data Engineer, Analyst, Pipeline Service Account, Admin) and assign permissions to roles. Users are assigned to roles. When someone changes jobs, you change their role — not 47 individual permissions.

RoleTypical permissions
Data EngineerRead/write to raw, silver, gold layers. Create and modify pipelines. No access to prod secrets.
AnalystRead-only on gold/reporting layer. Masked PII columns. No access to raw or silver.
Pipeline Service AccountRead source systems. Write to specific target tables only. No login access to database.
Privacy AdminRead unmasked PII. Execute deletion jobs. Access to audit logs.
AdminFull access. Requires approval workflow. Every action logged.

Attribute-Based Access Control (ABAC)

RBAC works well when roles are stable. ABAC is more fine-grained — access is granted based on attributes of the user, the data, and the context. For example: "an analyst can read customer data only if the customer's region matches the analyst's assigned region." BigQuery, Databricks Unity Catalog, and Apache Ranger all support ABAC-style row-level and column-level security.

-- Row-level security in PostgreSQL
-- Each analyst can only see rows for their assigned region

CREATE POLICY region_isolation ON customers
  USING (region = current_setting('app.user_region'));

ALTER TABLE customers ENABLE ROW LEVEL SECURITY;

-- In your application / pipeline connection:
-- SET app.user_region = 'south_india';
-- Now queries on customers only return south_india rows

5. GDPR — What Data Engineers Need to Know

GDPR (General Data Protection Regulation) is a European Union law that came into force in 2018. It applies to any company that processes personal data of EU residents — including Indian companies that have EU customers. Fines go up to 4% of global annual revenue. Meta was fined €1.2 billion in 2023.

You don't need to read all 99 articles. As a data engineer, these are the 5 GDPR requirements that directly affect how you build pipelines.

Right to Erasure (Art. 17)
A user can request deletion of all their personal data.
Your pipeline responsibility: Your pipeline must be able to find and delete (or crypto-erase) all records for a given user across every table, every layer (raw, silver, gold), and every backup within 30 days.
Right to Access (Art. 15)
A user can request a copy of all data you hold about them.
Your pipeline responsibility: You must be able to extract all records for user_id = X from your data warehouse and deliver them in a readable format. Your data catalogue must tell you every table that contains user data.
Data Minimisation (Art. 5)
Only collect data that is necessary for the stated purpose.
Your pipeline responsibility: Before ingesting a new field, confirm it has a documented business purpose. Remove unused columns from your pipelines. Don't land "everything" in the raw layer and decide later.
Purpose Limitation (Art. 5)
Data collected for one purpose cannot be used for a different purpose without new consent.
Your pipeline responsibility: If customers gave consent for order notifications, you cannot use their data to train an ML model without separate consent. Tag data with the consent purpose in your catalogue.
Data Breach Notification (Art. 33)
If a data breach occurs, the regulator must be notified within 72 hours.
Your pipeline responsibility: Maintain audit logs. Know exactly what data was accessed, when, and by whom. Without logs, you cannot scope a breach.

Crypto-erasure — the practical way to handle deletion in data lakes

Deleting a record from a data warehouse is easy. Deleting it from an immutable data lake (S3/ADLS with versioning) is hard. The practical solution is crypto-erasure: encrypt the user's PII with a user-specific key stored in a key management service. To "delete" the user, delete their encryption key. All their encrypted data becomes permanently unreadable without modifying any files.

# Crypto-erasure pattern
# Each user's PII is encrypted with a unique per-user key
# Deletion = deleting the key from KMS

import boto3
kms = boto3.client('kms', region_name='ap-south-1')

def get_or_create_user_key(user_id: str) -> str:
    """Return KMS key ARN for this user, creating if needed."""
    # In practice, store key ARN in a mapping table
    response = kms.create_key(
        Description=f'PII encryption key for user {user_id}',
        Tags=[{'TagKey': 'user_id', 'TagValue': user_id}]
    )
    return response['KeyMetadata']['KeyId']

def erase_user(user_id: str, key_id: str):
    """
    GDPR right to erasure via crypto-erasure.
    Schedules key deletion — AWS KMS minimum waiting period is 7 days.
    After deletion, all PII encrypted with this key is permanently unreadable.
    """
    kms.schedule_key_deletion(
        KeyId=key_id,
        PendingWindowInDays=7  # Minimum allowed by AWS KMS
    )
    print(f"Key for user {user_id} scheduled for deletion. PII will be unreadable in 7 days.")
    # Log this action to your audit trail
    log_audit_event('ERASURE_REQUESTED', user_id=user_id, key_id=key_id)

6. India Digital Personal Data Protection Act (DPDP) 2023

India's Digital Personal Data Protection Act was passed in August 2023. It is the first comprehensive personal data protection law in India, replacing a patchwork of older IT Act provisions. The rules (secondary legislation) were expected in 2024–2025 and are being finalized as of March 2026. The core obligations, however, are already clear.

💡 Note
The DPDP Act applies to digital personal data of Indian residents, processed in India or outside India if connected to offering goods/services to Indian residents. If your company has Indian users, this law applies to you.

Key concepts in DPDP for data engineers

DPDP TermPlain meaningYour pipeline implication
Data PrincipalThe individual whose data is being processed (your user)You must be able to identify all data for a given user_id across your systems
Data FiduciaryThe company that decides what data to collect and how to use it (your employer)Your company must appoint a Data Protection Officer for significant fiduciaries
ConsentMust be free, specific, informed, and unambiguous. No pre-checked boxes.Tag data with consent purpose. Don't use data beyond consented purpose.
Purpose limitationData used only for the specific purpose for which consent was givenSame as GDPR — documented business purpose required per field
Data erasureUser can request deletion. Company must delete when purpose is fulfilled.Same deletion capability as GDPR. Deletion when retention period expires, not just on request.
Data localisationCertain "significant" data fiduciaries may be required to store data in IndiaWatch for storage region requirements — may affect your cloud region choice
Children's dataParental consent required for users under 18. No behavioural tracking of children.If your platform has minors, age verification and restricted processing required

GDPR vs DPDP — similarities and differences

AreaGDPRIndia DPDP
ScopeEU residents' dataIndian residents' digital data
Legal basisConsent, legitimate interest, contract, legal obligation, vital interest, public taskConsent and "legitimate uses" (state functions, employment, emergencies, research)
Right to erasureYes — 30 daysYes — timeline per rules (expected similar)
Right to accessYes — detailed Subject Access RequestYes — right to access information about data processed
Data breach notification72 hours to regulatorWithout delay to Data Protection Board (timeline per rules)
FinesUp to €20M or 4% global revenueUp to ₹250 crore per instance (rules may revise)
DPO requirementRequired for certain organisationsRequired for "Significant Data Fiduciaries" (defined by rules)
Cross-border transferAdequacy decisions or standard clausesAllowed except to countries notified as restricted
🎯 Pro Tip
The practical pipeline architecture that satisfies GDPR also satisfies DPDP for most requirements. Build for GDPR-level rigour and you will be compliant with both. The differences are mainly in legal terminology and the specific thresholds in secondary legislation.

7. Compliance by Design — A Practical Checklist

Compliance bolted on after a pipeline is live is expensive and incomplete. Compliance built into the pipeline from the start is cheap and reliable. Here is the checklist you run when designing any pipeline that touches personal data.

Before you build
What personal data does this pipeline touch? List every field.
What is the documented business purpose for each field?
Do we have valid consent (or a legitimate interest) for each use?
Is there a simpler version of this data that achieves the same goal (minimisation)?
Where will data be stored? Which region? Who has access?
What is the retention period? How will it be enforced?
When you build
TLS enforced on all connections — no plaintext data in transit.
Encryption at rest enabled — using CMK if data is sensitive.
PII columns tagged in the data catalogue.
PII masked or pseudonymised in dev/test environments.
Column masking policies applied in the warehouse — analysts see masked data by default.
Access is role-based — no direct grants to individual users.
Audit logging enabled — who read what, when.
Deletion logic exists and is tested — can delete all records for user_id = X.
When you go live
Retention job scheduled — old data deleted automatically after retention period.
Breach response runbook exists — know who to call and what to do.
Data location documented — regulator can ask "where is this data stored?"
"Right to access" query documented — can export all data for one user on request.

8. Audit Logging

Audit logs answer "who did what, to which data, and when." They are your proof of compliance, your first tool in a breach investigation, and your defence in a regulatory audit. They are also one of the most commonly skipped pieces of data infrastructure.

# Minimal audit log event — write this to an immutable audit log table
# or a WORM (Write Once Read Many) log bucket

from datetime import datetime, timezone
import json

def log_audit_event(
    action: str,            # READ_PII, DELETE_RECORD, EXPORT_DATA, SCHEMA_CHANGE
    actor: str,             # user_id or service_account_name of who did it
    resource: str,          # table name, pipeline name, file path
    record_id: str = None,  # user_id or record_id affected (if applicable)
    metadata: dict = None,  # any additional context
):
    event = {
        'timestamp': datetime.now(timezone.utc).isoformat(),
        'action': action,
        'actor': actor,
        'resource': resource,
        'record_id': record_id,
        'metadata': metadata or {},
    }
    # Write to immutable audit log — append only, no updates, no deletes
    # Options: Cloud Storage with object lock, dedicated audit table, CloudTrail, Azure Monitor
    print(json.dumps(event))  # Replace with your log sink

# Examples
log_audit_event('READ_PII', 'analyst_ravi', 'customers', metadata={'purpose': 'support_ticket_123'})
log_audit_event('DELETE_RECORD', 'privacy_admin', 'customers', record_id='U98765', metadata={'reason': 'GDPR_erasure_request'})
log_audit_event('EXPORT_DATA', 'data_engineer', 'orders', record_id='U12345', metadata={'reason': 'DPDP_access_request'})
⚠️ Important
Audit logs must be stored in a separate, append-only location that pipeline engineers cannot modify. If the person who could delete records can also delete the audit trail, your audit log is worthless.

9. Secrets Management — Never Hardcode Credentials

This is one of the most common mistakes made by junior data engineers: database passwords, API keys, and storage account keys hardcoded in Python scripts or committed to Git. A secret in your Git history is a secret that was leaked — even if you delete the commit later, it may already be in a fork or a scan.

What not to doWhat to do instead
password = "Freshm@rt123!"password = os.environ["DB_PASSWORD"]
connection_string = "Server=...;Password=abc;"Fetch from Azure Key Vault / AWS Secrets Manager at runtime
API key hardcoded in Airflow DAG fileAirflow Variables or Connections (encrypted in Airflow metadata DB)
Storage account key in ADF linked service definitionUse Managed Identity — no key at all. ADF authenticates via Azure AD.
Credentials in Docker environment file committed to Git.env in .gitignore — inject via CI/CD secrets or Kubernetes secrets
# Fetching secrets from AWS Secrets Manager at runtime
import boto3
import json

def get_secret(secret_name: str, region: str = 'ap-south-1') -> dict:
    """Fetch a secret by name. Returns dict of key-value pairs."""
    client = boto3.client('secretsmanager', region_name=region)
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response['SecretString'])

# In your pipeline
db_secret = get_secret('freshmart/prod/postgres')
connection_string = (
    f"postgresql://{db_secret['username']}:{db_secret['password']}"
    f"@{db_secret['host']}:{db_secret['port']}/{db_secret['dbname']}"
)

# Azure equivalent: use DefaultAzureCredential + Key Vault
# from azure.keyvault.secrets import SecretClient
# from azure.identity import DefaultAzureCredential
# client = SecretClient(vault_url="https://kv-freshmart.vault.azure.net/", credential=DefaultAzureCredential())
# secret = client.get_secret("postgres-password").value

10. What This Looks Like at Work

Day 1 at a fintech (Razorpay / PhonePe / CRED)

Your first task might be: "We have a new DPDP compliance requirement — audit the raw layer and flag every column that contains personal data." You open the data catalogue, run a query across column names and sample values, and produce a spreadsheet with every PII field, its table, and its current protection status. That's a real day-one task, and it requires knowing what PII is.

At a healthcare company (Practo / Apollo)

Health data is "sensitive personal data" under DPDP. Your pipeline that ingests patient records into the data warehouse must use customer-managed encryption keys, full audit logging on every SELECT, and row-level security so only the assigned doctor's team can see their patients' records. A senior engineer will review your pipeline and specifically check these controls before approving.

In an interview

"How would you handle a GDPR deletion request in your current architecture?" is a common senior DE interview question. The right answer covers: locating all records for the user_id across all layers, the deletion mechanism (hard delete vs crypto-erasure for the lake), updating aggregates if the user contributed to pre-computed tables, and logging the deletion with a timestamp and actor.

Errors You'll Hit

SSL connection has been closed unexpectedly / SSL SYSCALL error
Why it happens: The database requires an SSL connection (sslmode=require) but the client tried to connect without SSL, or the certificate validation failed.
Fix: Add ?sslmode=require to the connection string. If using a self-signed certificate, provide the CA cert path with sslrootcert=/path/to/ca.pem. Never set sslmode=disable in production.
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the GetSecretValue operation
Why it happens: Your pipeline's IAM role does not have secretsmanager:GetSecretValue permission for this secret. Or the resource policy on the secret excludes your role.
Fix: Attach a policy granting secretsmanager:GetSecretValue on the specific secret ARN to your pipeline's execution role. Check both the role policy and the secret's resource-based policy.
cryptography.fernet.InvalidToken
Why it happens: Trying to decrypt data with a different key than the one used to encrypt it. Commonly happens after key rotation if old encrypted data is not re-encrypted before switching keys.
Fix: During key rotation, decrypt all existing data with the old key and re-encrypt with the new key before retiring the old key. Keep old key active until migration is complete.
OperationalError: SSL error: certificate verify failed
Why it happens: The SSL certificate presented by the server does not match the CA certificate your client is using to verify it. Common when moving between environments (dev uses self-signed, prod uses a proper CA).
Fix: Provide the correct CA certificate bundle. In cloud databases (RDS, Cloud SQL), download the CA cert from the cloud provider's documentation page and reference it in the connection config.
Policy evaluation denied access — explicit deny in a bucket policy (S3)
Why it happens: A bucket policy contains a Deny statement for aws:SecureTransport = false (HTTP requests). Your pipeline or tool is sending an HTTP request to the bucket instead of HTTPS.
Fix: Ensure your S3 client is configured to use HTTPS (the default in modern SDKs). If using a third-party tool, check for an http:// prefix in the bucket endpoint configuration.

🎯 Key Takeaways

  • Encryption in transit (TLS) and at rest are separate problems — you need both.
  • PII is any data that can identify a person, directly or in combination. Tag it, minimise it, mask it in dev, and control access to it in prod.
  • GDPR and India DPDP both require: consent, purpose limitation, right to erasure, right to access, and breach notification. Build deletion and access-export capability into every pipeline that touches personal data.
  • Crypto-erasure is the practical solution for GDPR deletion in immutable data lakes — encrypt per user with a unique key, then delete the key.
  • Audit logs must be immutable, append-only, and stored separately from the systems being audited.
  • Never hardcode credentials. Use environment variables in dev, secrets managers in production, and managed identities where the cloud supports them.
  • Compliance is cheapest when built in at design time. Retrofitting a non-compliant pipeline is 10× harder.
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...