Data Governance — Lineage, Cataloging, Access Control, and Data Mesh
What governance actually means in practice — lineage, cataloging, access control, GDPR, PII handling, and the data mesh organizational pattern.
Data Governance — The Engineering Discipline Behind Trust
Data governance is the system of policies, processes, and technical controls that ensure data in a platform is discoverable, trustworthy, secure, and compliant. The word "governance" sounds administrative — policy documents, steering committees, approval workflows. The reality for a data engineer is much more concrete: implementing lineage tracking so that when a metric is wrong you can trace it back to its source in minutes, building a data catalog so analysts can find and understand tables without asking on Slack, enforcing access control so PII is never exposed to unauthorised users, and handling right-to-erasure requests without breaking the pipeline.
Governance failures are expensive. An analyst uses the wrong table because the catalog has no descriptions. A GDPR audit finds that customer emails are visible to all analysts. A metric is wrong but nobody can trace which transformation introduced the error. A new hire spends three weeks understanding the data model that could have been documented. These are not abstract risks — they happen at every organisation that treats governance as optional.
Data Lineage — Tracing Every Dataset Back to Its Source
Data lineage is the record of how data moves through a platform — which source systems produced it, which transformations touched it, which downstream datasets depend on it. It answers two questions that come up in every data platform: "where did this data come from?" and "if I change this table, what breaks?"
Column-level lineage — the precise version
TABLE-LEVEL LINEAGE:
silver.orders ← bronze.orders (via dbt stg_orders)
gold.daily_revenue ← silver.orders + silver.stores
COLUMN-LEVEL LINEAGE (more precise):
gold.daily_revenue.net_revenue
← silver.orders.order_amount
← silver.orders.discount_amount
(transformation: order_amount - discount_amount)
gold.daily_revenue.store_region
← silver.stores.region
← silver.stores.state
(transformation: CASE WHEN state IN (...) THEN 'South' ...)
Column-level lineage tells you:
- Exactly which source column populated each Gold column
- Which transformations were applied along the way
- Impact analysis: if orders.order_amount definition changes,
which Gold columns are affected?
- Audit: prove to a regulator which source fields produced a reported metric
IMPLEMENTING LINEAGE WITH OPENLINEAGE:
OpenLineage is an open standard (CNCF project) for lineage event emission.
Any tool that implements it emits lineage events that any catalog can consume.
# OpenLineage event structure (emitted by Spark, dbt, Airflow, Flink):
{
"eventType": "COMPLETE",
"eventTime": "2026-03-17T06:14:32.000Z",
"run": {
"runId": "d7c7a7b8-3e1a-4a2c-9b4d-...",
"facets": {
"nominalTime": {"nominalStartTime": "2026-03-17T06:00:00Z"}
}
},
"job": {
"namespace": "freshmart.dbt",
"name": "silver.orders"
},
"inputs": [
{
"namespace": "freshmart.bronze",
"name": "orders",
"facets": {
"schema": {
"fields": [
{"name": "order_id", "type": "INTEGER"},
{"name": "customer_id","type": "INTEGER"},
{"name": "amount", "type": "DECIMAL"}
]
}
}
}
],
"outputs": [
{
"namespace": "freshmart.silver",
"name": "orders",
"facets": {
"columnLineage": {
"fields": {
"order_id": {"inputFields": [{"namespace":"freshmart.bronze","name":"orders","field":"order_id"}]},
"net_revenue":{"inputFields": [
{"namespace":"freshmart.bronze","name":"orders","field":"amount"},
{"namespace":"freshmart.bronze","name":"orders","field":"discount_amount"}
]}
}
}
}
}
]
}
LINEAGE BACKENDS:
Marquez: open source, stores OpenLineage events, REST API, simple UI
DataHub: comprehensive catalog + lineage, ingestion connectors for dbt/Spark/Airflow
Atlan: managed, deep dbt/Airflow integration
OpenMetadata: open source alternative to DataHub
dbt LINEAGE (automatic, no extra setup):
dbt generates a DAG of all models and their dependencies.
dbt docs generate produces a browsable lineage graph:
dbt docs generate && dbt docs serve
Shows: every model, its SQL, its upstream models, its downstream models.
Column-level lineage: dbt-column-lineage package adds column mapping.
USE CASE — impact analysis:
"I need to change the definition of customer.tier.
Which Gold models use this column?"
Answer: browse dbt lineage graph → click silver.customers.tier
→ see all downstream models highlighted
→ identify: gold.customer_segments, gold.daily_revenue (via customer tier filter)
→ plan migration accordinglyLineage for impact analysis — the practical workflow
SCENARIO: You need to change the Silver orders table —
add a new 'tip_amount' column and change how 'net_revenue' is calculated
(previously: order_amount - discount_amount,
new: order_amount - discount_amount + tip_amount)
WITHOUT LINEAGE:
You make the change, run dbt, and discover 4 Gold models break
because they all had column-specific assertions on net_revenue.
You also broke a dashboard that a non-dbt BI tool was using.
Investigation time: 3 hours.
WITH LINEAGE:
Step 1: Query DataHub / dbt graph for all downstream consumers of net_revenue:
$ dbt ls --select +silver.orders+ # all models that depend on silver.orders
Output:
silver.orders
gold.daily_revenue ← depends on silver.orders
gold.customer_ltv ← depends on silver.orders
gold.fct_orders_wide ← depends on silver.orders
gold.store_performance ← depends on silver.orders
Step 2: Check which downstream models USE net_revenue specifically:
gold.daily_revenue: SUM(net_revenue) ← affected
gold.customer_ltv: SUM(net_revenue) / ... ← affected
gold.fct_orders_wide: net_revenue column ← affected
gold.store_performance: SUM(net_revenue) ← affected
Step 3: Check for non-dbt consumers (BI tools, APIs, external queries):
DataHub catalog → silver.orders.net_revenue → "Downstream consumers" tab
Shows: Metabase dashboard "Daily Revenue" — uses this column directly
Step 4: Plan migration:
- Add tip_amount as a new column first (non-breaking)
- Update net_revenue in a backward-compatible way
- Notify Metabase dashboard owner of net_revenue change
- Update all 4 Gold models to handle new net_revenue semantics
- Deploy in order: Silver → Gold (same deployment, same dbt run)
- Update Metabase dashboard after validation
Total time with lineage: 30 minutes of planning, zero surprises.
Total time without: 3 hours of debugging broken things.Data Catalog — Making Data Discoverable
A data catalog is a searchable inventory of all data assets in the platform. Without one, analysts spend hours asking "which table should I use for revenue?" on Slack, use the wrong table, and make decisions on data they don't understand. With one, they search for "revenue", find the canonical Gold table, read its description and owner, check when it was last updated, and start their analysis immediately.
DataHub — the open source enterprise catalog
DataHub is the most widely deployed open source data catalog in 2026.
It has native ingestion connectors for: dbt, Spark, Airflow, Snowflake,
BigQuery, Redshift, Kafka, Looker, Tableau, and 40+ other tools.
DATAHUB INGESTION (recipe configuration):
# datahub_recipes/dbt_freshmart.yml
source:
type: dbt
config:
manifest_path: /path/to/dbt/target/manifest.json
catalog_path: /path/to/dbt/target/catalog.json
sources_path: /path/to/dbt/target/sources.json
target_platform: snowflake
# Ingest: models, columns, descriptions, tags, owners, tests
include_column_lineage: true # column-level lineage from dbt
# Map dbt model owners to DataHub users:
owner_extraction_pattern: "^Team:(?P<owner>.*)$"
sink:
type: datahub-rest
config:
server: "http://datahub-gms:8080"
# datahub_recipes/snowflake_freshmart.yml
source:
type: snowflake
config:
account_id: freshmart.snowflake.com
username: datahub_service_account
database: freshmart_prod
warehouse: analyst_wh
include_table_lineage: true # query log-based lineage
include_column_lineage: true
include_usage_stats: true # who queried what and when
# RUN: datahub ingest -c datahub_recipes/dbt_freshmart.yml
# RUN: datahub ingest -c datahub_recipes/snowflake_freshmart.yml
WHAT DATAHUB PROVIDES AFTER INGESTION:
Search: "net revenue" → finds gold.daily_revenue.net_revenue
Description: "Net revenue after discounts. Source: silver.orders."
Owner: data-team@freshmart.com
Lineage: upstream: silver.orders, downstream: Metabase dashboard
Schema: all columns with types and descriptions
Usage: queried by priya@freshmart.com, 48 times in last 7 days
Quality: last dbt test run: all 12 tests passed, 2026-03-17
Tags: [PII-free, Gold, Finance, SLA-monitored]
Glossary: "Net Revenue" → business definition from Finance team
MAKING dbt DESCRIPTIONS FLOW INTO DATAHUB:
# models/gold/_schema.yml
models:
- name: daily_revenue
description: >
Pre-aggregated daily revenue by store and date.
Updated daily at 06:30 IST. SLA: complete by 08:00 IST.
Source of truth for Finance dashboard.
meta:
owner: data-team@freshmart.com
sla: "08:00 IST daily"
consumers: [finance-dashboard, cfo-report]
columns:
- name: net_revenue
description: >
Total order revenue after discounts.
Calculation: SUM(order_amount - discount_amount)
from silver.orders WHERE status = 'delivered'.
meta:
is_pii: false
business_owner: finance@freshmart.com
# When dbt docs generate runs → DataHub ingestion picks up descriptions
# Descriptions visible in DataHub search and table pages
DEVERYDAY GOVERNANCE WORKFLOW FOR A DATA ENGINEER:
Before adding a new Gold table:
1. Create dbt model with complete description in schema.yml
2. Add owner meta field (team email)
3. Add column descriptions for all columns
4. Add relevant tags (PII-free, Gold, team)
5. Add to DataHub business glossary if it defines a new term
After deployment:
DataHub ingestion runs → table discoverable in catalog within 1 hour
Analyst can find it, read the description, understand the grain
Without Slack DMs to the data teamBusiness glossary — defining terms once
PROBLEM WITHOUT A GLOSSARY:
Finance: "revenue" = sum of all order amounts including cancelled
Operations: "revenue" = delivered orders only
Product: "revenue" = GMV (gross merchandise value, all statuses)
Three teams, three definitions, three different numbers in the same meeting.
BUSINESS GLOSSARY IN DATAHUB:
Term: "Net Revenue"
Definition: Sum of (order_amount - discount_amount) for orders with
status = 'delivered'. Excludes cancelled and in-progress orders.
Approved by: CFO on 2026-01-15.
Owners: finance@freshmart.com (business), data-team@freshmart.com (technical)
Linked assets: gold.daily_revenue.net_revenue, gold.customer_ltv.net_revenue
Term: "GMV (Gross Merchandise Value)"
Definition: Sum of order_amount for all orders regardless of status.
Used for investor reporting. Does NOT subtract discounts.
Owners: product@freshmart.com
Linked assets: gold.investor_metrics.gmv
Once defined: every table column tagged with "Net Revenue" gets the
canonical definition. Analysts see the definition when they hover the column.
The SAME definition appears in every BI tool, in DataHub, and in dbt docs.
CREATING A GLOSSARY TERM IN DATAHUB (Python API):
import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import GlossaryTermInfoClass
term_urn = builder.make_glossary_term_urn("NetRevenue")
term_info = GlossaryTermInfoClass(
definition="Sum of (order_amount - discount_amount) for status='delivered'.",
termSource="INTERNAL",
sourceRef="Finance/MetricsDefinitions.pdf",
sourceUrl="https://docs.freshmart.internal/metrics/net-revenue",
)
emitter.emit_mcp(MetadataChangeProposalWrapper(
entityUrn=term_urn,
aspect=term_info,
))Access Control — Who Can See What and Why
Access control in a data platform operates at multiple levels: storage-level (S3 bucket policies), compute-level (warehouse roles), table-level (GRANT/REVOKE), and column-level (masking policies). Each level serves a different purpose. Getting access control right is not just a compliance requirement — it is operational safety. A data engineer who accidentally has write access to production Gold tables can corrupt them. An analyst with access to Bronze PII accidentally exfiltrates customer data.
ROLE HIERARCHY FOR A DATA PLATFORM (Snowflake example):
ROLE HIERARCHY:
SYSADMIN
└── DATA_PLATFORM_ADMIN ← full platform access (data engineering lead)
├── PIPELINE_ROLE ← transformation pipelines (read bronze, write silver/gold)
├── ANALYST_ROLE ← read silver/gold, no PII, no bronze
├── DATA_SCIENTIST_ROLE ← read bronze/silver/gold, no PII columns
├── BI_SERVICE_ROLE ← read gold only, used by Metabase service account
├── FINANCE_ROLE ← read gold finance schema only
└── OPERATIONS_ROLE ← read gold operations schema only
GRANT STATEMENTS:
-- PIPELINE_ROLE: runs dbt, reads bronze, writes silver and gold
GRANT USAGE ON DATABASE freshmart_prod TO ROLE pipeline_role;
GRANT USAGE ON SCHEMA freshmart_prod.bronze TO ROLE pipeline_role;
GRANT SELECT ON ALL TABLES IN SCHEMA freshmart_prod.bronze TO ROLE pipeline_role;
GRANT USAGE, CREATE TABLE ON SCHEMA freshmart_prod.silver TO ROLE pipeline_role;
GRANT USAGE, CREATE TABLE ON SCHEMA freshmart_prod.gold TO ROLE pipeline_role;
-- ANALYST_ROLE: read silver and gold, no bronze (has raw PII), no write
GRANT USAGE ON DATABASE freshmart_prod TO ROLE analyst_role;
GRANT USAGE ON SCHEMA freshmart_prod.silver TO ROLE analyst_role;
GRANT USAGE ON SCHEMA freshmart_prod.gold TO ROLE analyst_role;
GRANT SELECT ON ALL TABLES IN SCHEMA freshmart_prod.silver TO ROLE analyst_role;
GRANT SELECT ON ALL TABLES IN SCHEMA freshmart_prod.gold TO ROLE analyst_role;
-- Explicitly block: no access to bronze (raw PII in landing/bronze)
-- BI_SERVICE_ROLE: Metabase service account, read gold only
GRANT USAGE ON SCHEMA freshmart_prod.gold TO ROLE bi_service_role;
GRANT SELECT ON ALL TABLES IN SCHEMA freshmart_prod.gold TO ROLE bi_service_role;
-- FUTURE GRANTS: apply to tables created after the GRANT statement
GRANT SELECT ON FUTURE TABLES IN SCHEMA freshmart_prod.silver TO ROLE analyst_role;
GRANT SELECT ON FUTURE TABLES IN SCHEMA freshmart_prod.gold TO ROLE analyst_role;
-- Without FUTURE GRANTS: every new Silver/Gold table must be manually granted
COLUMN-LEVEL MASKING (Snowflake Dynamic Data Masking):
-- Create masking policy for email column:
CREATE OR REPLACE MASKING POLICY mask_email_pii
AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_PLATFORM_ADMIN', 'PIPELINE_ROLE')
THEN val -- engineers see raw email
WHEN CURRENT_ROLE() = 'DATA_SCIENTIST_ROLE'
THEN SHA2(val, 256) -- scientists see hash
ELSE '***MASKED***' -- everyone else sees mask
END;
-- Apply masking policy to the column:
ALTER TABLE freshmart_prod.silver.customers
ALTER COLUMN customer_email
SET MASKING POLICY mask_email_pii;
-- Now:
-- Analyst queries silver.customers: customer_email = '***MASKED***'
-- Data engineer queries silver.customers: customer_email = 'priya@example.com'
-- Data scientist queries silver.customers: customer_email = 'sha256hash...'
-- All three query the SAME table — masking is transparent
ROW-LEVEL SECURITY (filter rows by user attributes):
-- Store managers should only see their store's data in gold.store_performance
CREATE OR REPLACE ROW ACCESS POLICY store_data_isolation
AS (store_id VARCHAR) RETURNS BOOLEAN ->
CURRENT_ROLE() IN ('DATA_PLATFORM_ADMIN', 'ANALYST_ROLE') -- full access
OR store_id = (
SELECT assigned_store_id
FROM governance.user_store_assignments
WHERE username = CURRENT_USER()
);
ALTER TABLE freshmart_prod.gold.store_performance
ADD ROW ACCESS POLICY store_data_isolation ON (store_id);
-- Store manager queries gold.store_performance:
-- Automatically filtered to their assigned store only
-- No WHERE clause needed — enforced at engine levelAWS Lake Formation — access control for S3 data lakes
# Lake Formation sits on top of S3 + Glue Catalog.
# Grants table/column/row level permissions on Glue catalog tables.
# Works with: Athena, Redshift Spectrum, EMR, Glue ETL.
import boto3
lf = boto3.client('lakeformation')
# Grant column-level permission (exclude PII columns):
lf.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456:role/AnalystRole'},
Resource={
'TableWithColumns': {
'DatabaseName': 'freshmart_silver',
'Name': 'customers',
# Grant access to all columns EXCEPT these PII columns:
'ColumnWildcard': {
'ExcludedColumnNames': ['customer_email', 'phone_number', 'address']
},
}
},
Permissions=['SELECT'],
)
# Grant table-level read on gold:
lf.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456:role/AnalystRole'},
Resource={
'Table': {
'DatabaseName': 'freshmart_gold',
'Name': 'daily_revenue',
}
},
Permissions=['SELECT'],
)
# DATA FILTER — row-level security via filter expression:
lf.create_data_cells_filter(
TableData={
'TableCatalogId': '123456789',
'DatabaseName': 'freshmart_gold',
'TableName': 'store_performance',
'Name': 'south_india_filter',
# Row filter — only South India stores:
'RowFilter': {
'FilterExpression': "store_region = 'South'"
},
# Column filter — exclude financial metrics:
'ColumnWildcard': {
'ExcludedColumnNames': ['gross_margin_pct', 'operating_cost']
},
}
)
# AUDIT LOG — all access recorded to CloudTrail:
# Every SELECT on a Lake Formation-registered table generates a CloudTrail event.
# Query: which users accessed silver.customers in the last 30 days?
# SELECT userIdentity.principalId, eventTime, requestParameters.tableName
# FROM cloudtrail_logs
# WHERE eventName = 'GetTable'
# AND requestParameters.tableName = 'customers'
# AND eventTime > CURRENT_DATE - 30PII Classification, GDPR, and Right-to-Erasure
GDPR (EU) and PDPB (India's Personal Data Protection Bill, enacted 2023) require that organisations: know where personal data is stored, protect it with appropriate controls, and honour erasure requests within 30 days. For a data platform, this means: classifying which columns contain PII, masking or removing PII at the Silver layer boundary, and implementing a right-to-erasure pipeline that can delete or anonymise a specific customer's data across all layers.
# PII CLASSIFICATION TAXONOMY:
# Level 1 — Direct identifiers (highest sensitivity):
# customer_email, phone_number, national_id (Aadhaar), passport_number
# bank_account_number, credit_card_number
# Level 2 — Indirect identifiers (can identify combined with other data):
# full_name, address, date_of_birth, ip_address, device_id, GPS coordinates
# Level 3 — Quasi-identifiers (alone not identifying, combined risky):
# city, gender, age_group, job_title, company_name
# Level 4 — Non-PII (no restriction):
# order_amount, product_category, store_id, order_status
# AUTOMATED PII DETECTION (using regex + ML heuristics):
import re
from dataclasses import dataclass
@dataclass
class PIIDetectionResult:
column: str
pii_level: int
pii_type: str
confidence: float
PATTERNS = {
'email': (1, re.compile(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}')),
'phone_in': (1, re.compile(r'(?:+91|0)?[6-9]d{9}')),
'aadhaar': (1, re.compile(r'd{4}s?d{4}s?d{4}')),
'pan': (1, re.compile(r'[A-Z]{5}[0-9]{4}[A-Z]')),
'name': (2, None), # requires ML classifier — name detection is hard with regex
'ip_addr': (2, re.compile(r'(?:d{1,3}.){3}d{1,3}')),
}
def detect_pii_in_column(sample_values: list[str], column_name: str) -> PIIDetectionResult:
"""Scan a sample of column values for PII patterns."""
hits = {'email': 0, 'phone_in': 0, 'aadhaar': 0, 'pan': 0, 'ip_addr': 0}
total = len(sample_values)
for val in sample_values:
if not isinstance(val, str):
continue
for pii_type, (level, pattern) in PATTERNS.items():
if pattern and pattern.search(val):
hits[pii_type] += 1
for pii_type, count in hits.items():
confidence = count / total if total > 0 else 0
if confidence > 0.3: # > 30% of sampled values match
level = PATTERNS[pii_type][0]
return PIIDetectionResult(
column=column_name, pii_level=level,
pii_type=pii_type, confidence=round(confidence, 2),
)
return PIIDetectionResult(column=column_name, pii_level=4,
pii_type='none', confidence=1.0)
# TAGGING PII IN dbt SCHEMA.YML:
# models/bronze/_schema.yml
columns:
- name: customer_email
meta:
pii_level: 1
pii_type: email
masking_required: true
retain_in_layers: [bronze] # raw email only in bronze
transform_for_silver: "SHA2(customer_email, 256)"
- name: phone_number
meta:
pii_level: 1
pii_type: phone
masking_required: true
transform_for_silver: "REGEXP_REPLACE(phone_number, '[0-9]', 'X')"
# dbt macro that reads pii meta and generates masking transformations:
# {{ transform_pii_column('customer_email') }}
# Generates: SHA2(customer_email, 256) AS customer_email_hashedRight-to-erasure — the GDPR delete pipeline
"""
GDPR Right-to-Erasure (Article 17) pipeline.
Customer submits a deletion request.
Within 30 days: their PII must be anonymised or deleted across all layers.
CHALLENGE: The Medallion Architecture is designed for immutability.
Bronze is append-only. We cannot just DELETE rows.
"""
from datetime import date, datetime
from typing import Optional
import hashlib
DELETION_SENTINEL = 'GDPR_ERASED'
def process_erasure_request(
customer_id: int,
request_date: date,
conn,
) -> dict:
"""
Anonymise or delete a customer's PII across all layers.
Does NOT delete the row — replaces PII values with a sentinel.
Preserves non-PII columns (order counts, amounts) for statistical use.
"""
results = {}
# ── BRONZE LAYER ──────────────────────────────────────────────────────────
# Bronze is the raw layer. PII must be overwritten here first.
# This is the ONLY time we write UPDATE to Bronze — for legal compliance.
bronze_rows = conn.execute("""
UPDATE bronze.customers
SET customer_email = %s,
phone_number = %s,
full_name = %s,
ip_address = %s,
gdpr_erased = TRUE,
gdpr_erased_at = %s
WHERE customer_id = %s
RETURNING customer_id
""", (DELETION_SENTINEL, DELETION_SENTINEL, DELETION_SENTINEL,
DELETION_SENTINEL, datetime.utcnow(), customer_id)).fetchall()
results['bronze_rows_updated'] = len(bronze_rows)
# ── SILVER LAYER ──────────────────────────────────────────────────────────
# Silver has already-masked columns (email_hashed) but may still have
# quasi-identifiers like full_name or address.
conn.execute("""
UPDATE silver.customers
SET customer_name = %s,
address = %s,
gdpr_erased = TRUE
WHERE customer_id = %s
""", (DELETION_SENTINEL, DELETION_SENTINEL, customer_id))
# Update SCD2 dim_customer — ALL versions:
conn.execute("""
UPDATE gold.dim_customer
SET customer_name = %s
WHERE customer_id = %s
""", (DELETION_SENTINEL, customer_id))
# ── GOLD LAYER ────────────────────────────────────────────────────────────
# Gold aggregates (revenue by store, etc.) do NOT contain PII rows.
# Pre-computed aggregates are fine — customer is anonymous in aggregates.
# No update needed for most Gold tables.
# EXCEPTION: Gold fct_orders_wide has customer city + region at order level
# If these are quasi-identifying in context: anonymise them too.
conn.execute("""
UPDATE gold.fct_orders_wide
SET customer_tier = NULL,
customer_city = NULL,
customer_region = NULL
WHERE customer_id = %s
""", (customer_id,))
# ── RECORD ERASURE ────────────────────────────────────────────────────────
conn.execute("""
INSERT INTO governance.erasure_requests
(customer_id, request_date, completed_at, layers_updated, status)
VALUES (%s, %s, %s, %s, 'completed')
""", (customer_id, request_date, datetime.utcnow(), str(results)))
conn.commit()
# ── DOWNSTREAM SYSTEMS ─────────────────────────────────────────────────────
# ML models trained on this customer's data: log for retraining audit
# Kafka topics: publish erasure event for downstream consumers
# Data Vault: update satellite records for this hub key
# All systems subscribed to the erasure event must handle PII deletion.
return results
# GDPR COMPLIANCE CHECKLIST FOR A DATA PLATFORM:
# ✓ PII inventory: every column tagged with pii_level in schema.yml
# ✓ Masking at Silver boundary: PII replaced before analyst access
# ✓ Access control: Bronze (raw PII) not accessible to analysts
# ✓ Audit log: all accesses to PII-containing tables logged
# ✓ Erasure pipeline: tested, runs within 30 days of request
# ✓ Data retention: Bronze PII tables have lifecycle policies (max 2 years)
# ✓ Data residency: customer data stays in Indian region (PDPB requirement)
# ✓ Consent log: when consent was given/revoked, stored in governance schemaData Mesh — When to Decentralise Data Ownership
Data mesh is an organisational and architectural pattern for data platforms at scale. It proposes that data ownership should be decentralised — domain teams (orders, payments, logistics) own and publish their data as "data products," and a central platform team provides the infrastructure and standards. It is a response to the bottleneck that emerges at large organisations when a central data engineering team is the only team that can build data pipelines.
Data mesh is frequently misunderstood as a technology choice. It is primarily an organisational decision. The four principles are: domain ownership, data as a product, self-serve infrastructure, and federated computational governance.
Data mesh vs centralised — the trade-off table
| Dimension | Centralised DE team | Data Mesh |
|---|---|---|
| Org size sweet spot | < 50 engineers, 1-3 domain teams | 200+ engineers, 10+ domain teams |
| Pipeline bottleneck | Central team becomes bottleneck at scale | Domain teams own their pipelines — no central bottleneck |
| Data quality ownership | Central DE team owns quality for all domains | Domain teams own their data product quality |
| Schema changes | Central team coordinates across all consumers | Domain team owns their schema, publishes contract |
| Consistency | High — one team, one standard | Harder — federated teams with varying maturity |
| Implementation complexity | Lower — one team, one repo, one approach | High — governance infrastructure, domain onboarding, contract standards |
| When to choose | Most startups and mid-size companies | When central team is unable to serve all domain needs, and domains have DE capability |
The Governance Tooling Landscape — What Each Tool Does
| Tool | Category | What it does | Best for |
|---|---|---|---|
| DataHub | Catalog + Lineage | Open source. Ingestion connectors for 40+ tools. Search, lineage graph, business glossary, data quality integration. | Enterprise open source catalog, strong dbt+Airflow integration |
| OpenMetadata | Catalog + Lineage | Open source alternative to DataHub. Newer, simpler setup, built-in data quality integration. | Teams that find DataHub too complex to operate |
| Amundsen (Lyft) | Catalog | Open source. Search-focused. Graph database backend. Less active development than DataHub. | Historically popular — DataHub has largely superseded it |
| Alation | Catalog | Enterprise managed. AI-powered search, data stewardship workflows, governance policies. | Large enterprises with budget and compliance needs |
| Atlan | Catalog + Governance | Managed. Deep Slack integration, persona-based views, automated PII tagging, dbt native. | Modern data teams, strong dbt shops |
| OpenLineage | Lineage standard | CNCF open standard for lineage event emission. Not a tool — a protocol. | The standard to adopt — all tools should emit OpenLineage events |
| Marquez | Lineage backend | Open source OpenLineage event store with REST API and UI. From WeWork. | Simple open source lineage backend for OpenLineage events |
| Unity Catalog | Access + Governance | Databricks native. 3-level namespace, column masking, row security, auto lineage, audit logs. | Databricks lakehouse platforms |
| AWS Lake Formation | Access Control | AWS native. Fine-grained access on Glue catalog tables. Works with Athena, Redshift Spectrum, EMR. | AWS-native data lake platforms |
| Elementary | Data Observability | dbt-native anomaly detection and data observability. Auto-tracks row counts, null rates, distributions. | dbt-based platforms wanting low-friction observability |
| Monte Carlo | Data Observability | Managed data observability. ML-based anomaly detection across warehouse and pipelines. | Enterprise platforms willing to pay for managed observability |
A GDPR Audit Finds PII Exposed to Analysts — The Remediation
FreshMart's Data Protection Officer conducts a GDPR audit. They find that customer emails and phone numbers in silver.customers are directly accessible to 12 analysts, the data engineer who quit last year still has active Snowflake credentials, and there is no audit log of who accessed customer data. You are given two weeks to fix all three.
FINDING 1: PII accessible to all analysts in silver.customers
CURRENT STATE:
GRANT SELECT ON ALL TABLES IN SCHEMA silver TO ROLE analyst_role;
silver.customers has: customer_email, phone_number, full_name, address
FIX — Column masking policy (Snowflake):
-- Step 1: Create masking policies for each PII column type
CREATE MASKING POLICY mask_email AS (v VARCHAR) RETURNS VARCHAR ->
CASE WHEN CURRENT_ROLE() IN ('PIPELINE_ROLE','DATA_PLATFORM_ADMIN')
THEN v ELSE SHA2(v, 256) END;
CREATE MASKING POLICY mask_phone AS (v VARCHAR) RETURNS VARCHAR ->
CASE WHEN CURRENT_ROLE() IN ('PIPELINE_ROLE','DATA_PLATFORM_ADMIN')
THEN v ELSE REGEXP_REPLACE(v, '.', 'X') END;
CREATE MASKING POLICY mask_name AS (v VARCHAR) RETURNS VARCHAR ->
CASE WHEN CURRENT_ROLE() IN ('PIPELINE_ROLE','DATA_PLATFORM_ADMIN')
THEN v ELSE LEFT(v, 1) || '***' END; -- 'P***' (initial only)
-- Step 2: Apply to PII columns
ALTER TABLE silver.customers ALTER COLUMN customer_email
SET MASKING POLICY mask_email;
ALTER TABLE silver.customers ALTER COLUMN phone_number
SET MASKING POLICY mask_phone;
ALTER TABLE silver.customers ALTER COLUMN full_name
SET MASKING POLICY mask_name;
-- Verify: analyst queries silver.customers:
-- customer_email = 'a3f4...sha256hash...'
-- phone_number = 'XXXXXXXXXX'
-- full_name = 'P***'
-- No data model changes needed. No view required. Column masking is transparent.
FINDING 2: Departed employee still has active credentials
FIX — Access review and automated offboarding:
-- Audit: find all users with recent login activity
SELECT user_name, last_success_login, login_history
FROM snowflake.account_usage.users
WHERE disabled = FALSE
ORDER BY last_success_login DESC;
-- Found: former_employee@freshmart.com, last login 47 days ago
-- Disable immediately:
ALTER USER former_employee@freshmart.com SET DISABLED = TRUE;
-- Preventive: Automate offboarding via HR system integration
-- When HR marks employee as departed:
-- 1. Disable Snowflake user (API call)
-- 2. Revoke all role assignments
-- 3. Log access review completion
-- 4. Notify security team
FINDING 3: No audit log of customer data access
FIX — Enable Snowflake access history and build audit query:
-- Snowflake retains access_history for 365 days in ACCOUNT_USAGE schema.
-- No setup needed — it is always on.
-- DPO needed access to query it:
GRANT SELECT ON SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY TO ROLE dpo_role;
-- DPO audit query: who accessed customer PII in the last 90 days?
SELECT
user_name,
query_start_time,
query_text,
base_objects_accessed
FROM snowflake.account_usage.access_history,
LATERAL FLATTEN(base_objects_accessed) f
WHERE f.value:objectName::STRING ILIKE '%silver.customers%'
AND query_start_time >= CURRENT_DATE - 90
ORDER BY query_start_time DESC;
-- Returns: complete log of every access to silver.customers
-- With: who, when, what query
-- Ongoing: schedule weekly DPO report to governance.pii_access_log
-- for permanent audit trail beyond Snowflake's 365-day retention.
OUTCOMES AFTER REMEDIATION:
- PII masked for all analysts: same-day fix via column masking
- Departed employee disabled: same day
- Audit log available to DPO: 1 day (access grant + query setup)
- Full PII inventory across all tables: 1 week (schema.yml audit)
- Automated offboarding process: 2 weeks (HR integration)
- GDPR compliance documentation updated: 2 weeks5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Data governance has four practical pillars: Discoverability (analysts find the right table without asking on Slack), Lineage (every dataset traceable back to its source), Access Control (right people see right data, PII protected), Compliance (GDPR erasure fulfilled, retention enforced, audit logs maintained).
- ✓Column-level lineage is more valuable than table-level lineage. It tells you exactly which source column produced each Gold column. Use it for impact analysis before any schema change: check every downstream consumer of the column you are changing before deploying.
- ✓DataHub is the leading open source data catalog. It ingests metadata from dbt (manifest.json + catalog.json), Snowflake query logs, Airflow, Spark, and 40+ other tools. Requires both manifest.json AND catalog.json for column descriptions. Run dbt docs generate in CI so catalog.json is always current.
- ✓Business glossary centralises business term definitions. "Net Revenue," "GMV," and "Active Customer" each have one canonical definition approved by the business, linked to the column(s) that implement it. Eliminates the meeting where Finance and Product report different revenue numbers.
- ✓Role-based access control: pipeline service account reads Bronze, writes Silver/Gold. Analyst role reads Silver/Gold only, never Bronze (which has raw PII). BI service account reads Gold only. Always use FUTURE GRANTS so new tables automatically inherit the correct permissions without manual grant statements.
- ✓Column masking policies in Snowflake apply masking logic transparently based on the querying role — analysts always query silver.customers but see masked emails. Superior to maintaining separate views per role: one policy, one maintenance point, invisible to query writers. Masking does NOT apply to OWNERSHIP or ACCOUNTADMIN roles.
- ✓PII classification levels: Level 1 (direct identifiers — email, phone, Aadhaar), Level 2 (indirect — name, address, IP), Level 3 (quasi-identifiers — city, age), Level 4 (non-PII). Tag every column with its level in dbt schema.yml meta fields. Apply masking at Silver layer for Level 1 and Level 2.
- ✓GDPR right-to-erasure: replace PII values with a sentinel (GDPR_ERASED) rather than deleting rows. Update Bronze → Silver → Gold in that order. This is the only legitimate UPDATE to the append-only Bronze layer — legal compliance supersedes the immutability design principle. Record the erasure in governance.erasure_requests. Publish an erasure event for downstream systems.
- ✓Data mesh is an organisational pattern, not a technology choice. Appropriate when a central DE team is a bottleneck at scale (200+ engineers, 10+ domains). Domain teams own their data products. Central platform team provides infrastructure and standards. Adopting it too early produces inconsistent quality across domains because domain teams lack DE capability.
- ✓Lineage for impact analysis workflow: before changing a model, run dbt ls --select +model_name+ to find all downstream models. Cross-check in DataHub for non-dbt consumers (BI tools, APIs). Update all downstream models together in one deployment. Never change a Gold column definition without knowing all downstream consumers — one unplanned downstream breakage erodes team trust more than any performance issue.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.