Data Vault 2.0
Hubs, links, and satellites from first principles — hash keys, loading patterns, Business Vault, PIT tables, and when to choose Data Vault over dimensional modelling.
Data Vault 2.0 — Built for Auditability and Multi-Source Integration
Data Vault is a data modelling methodology invented by Dan Linstedt in the early 2000s and formalised as version 2.0 in 2013. It was designed for enterprise integration — loading data from multiple heterogeneous source systems, adapting to frequent source schema changes without remodelling, and producing a fully auditable historical record of every fact ever loaded from every source.
Data Vault is not a replacement for dimensional modelling. It occupies a separate layer: the Raw Vault stores everything that arrived from sources in a highly normalised, INSERT-only form. The Business Vault applies business rules. The Information Mart (dimensional model) serves analytical consumers. Data Vault is the integration layer between raw ingestion and analytical delivery.
Hubs, Links, and Satellites — The Building Blocks
Every table in a Raw Vault is exactly one of three types. Each has a specific purpose, required columns, and strict rules. Violating these rules breaks auditability and parallelism.
DATA VAULT TABLE TYPE 1: HUB
Purpose: Records the existence of a business concept.
Stores the unique business key from source systems.
Rule: ONLY contains the business key and metadata. No descriptive attributes.
Columns: hub_[entity]_hk CHAR(32) PK ← MD5 hash of business key
[entity]_bk VARCHAR ← the business key from source
load_dts TIMESTAMPTZ ← when first loaded (INSERT ONLY)
record_source VARCHAR ← which source system
HUB_CUSTOMER:
hub_customer_hk customer_bk load_dts record_source
MD5('4201938') '4201938' 2026-03-17 06:00 UTC freshmart_orders_db
MD5('USR-42019') 'USR-42019' 2026-03-17 07:00 UTC loyalty_app
Note: these two may be the same real customer — resolved in Business Vault via SAL.
The hub just records that each key was seen from its source.
DATA VAULT TABLE TYPE 2: LINK
Purpose: Records the relationship between two or more entities.
Rule: ONLY hub hash keys + metadata. No descriptive attributes.
Columns: lnk_[rel]_hk CHAR(32) PK ← hash of combined hub HKs
hub_[entity_1]_hk CHAR(32) ← FK to hub 1
hub_[entity_2]_hk CHAR(32) ← FK to hub 2
load_dts TIMESTAMPTZ
record_source VARCHAR
LNK_ORDER_CUSTOMER:
lnk_hk hub_order_hk hub_customer_hk load_dts
MD5(hk1||hk2) MD5('9284751') MD5('4201938') 2026-03-17 ...
DATA VAULT TABLE TYPE 3: SATELLITE
Purpose: Stores descriptive attributes + full history of changes.
Rule: ONLY attributes from ONE source. If two sources describe the same
customer differently: TWO separate satellites, one per source.
Columns: hub_[entity]_hk CHAR(32) ← FK to parent hub (part of PK)
load_dts TIMESTAMPTZ ← when this version loaded (part of PK)
load_end_dts TIMESTAMPTZ ← when superseded (NULL = current)
hash_diff CHAR(32) ← hash of all attributes (change detect)
record_source VARCHAR
[descriptive attributes]
SAT_CUSTOMER_ORDERS_DB (from orders database):
hub_customer_hk load_dts load_end_dts city tier
MD5('4201938') 2024-01-15 06:00 2026-02-01 06:00 Bangalore silver ← expired
MD5('4201938') 2026-02-01 06:00 NULL Hyderabad silver ← current
SAT_CUSTOMER_LOYALTY_APP (from loyalty app — separate satellite):
hub_customer_hk load_dts city loyalty_points
MD5('4201938') 2024-03-01 09:00 Bangalore 4200
TWO satellites for the same customer: both raw versions preserved for audit.Hash Keys — The Standard That Enables Parallel Loading
Hash keys are deterministic — computed entirely from the business key string with no shared state, no sequence generator, no locks. Any process that has the source business key can independently compute the identical hash. This enables all hub, link, and satellite tables to load in complete parallel from the same source data.
ALGORITHM: MD5 (128-bit) or SHA-256 (256-bit)
Choose one algorithm per vault — never mix within a vault.
Enterprise/regulated: SHA-256. General purpose: MD5.
NORMALISATION (mandatory before hashing):
ALWAYS uppercase and trim: hash_key = MD5(UPPER(TRIM(business_key)))
'ST001', 'st001', ' ST001 ' must all produce the SAME hash.
Different source systems may deliver the same key with different casing.
COMPOSITE KEY (for links — hash of multiple hub HKs):
link_hk = MD5(hub_customer_hk || '||' || hub_order_hk)
Use '||' separator to prevent collision: ('abc','def') vs ('ab','cdef').
NULL HANDLING:
Replace NULL with sentinel: COALESCE(UPPER(TRIM(customer_id)), 'N/A')
Ensures NULL from different sources produces the same hash.
HASH DIFF (for satellites — change detection):
hash_diff = MD5(UPPER(TRIM(city)) || '||' || UPPER(TRIM(tier)) || '||' ...)
If hash_diff changes between loads: insert new satellite row.
If unchanged: skip (no new row needed).
NEVER include pipeline timestamps (ingested_at, load_dts) in hash_diff.
PYTHON IMPLEMENTATION:
import hashlib
def hub_hk(business_key: str) -> str:
return hashlib.md5(
str(business_key).upper().strip().encode('utf-8')
).hexdigest()
def link_hk(*hub_hks: str) -> str:
combined = '||'.join(hk.upper() for hk in hub_hks)
return hashlib.md5(combined.encode('utf-8')).hexdigest()
def sat_hashdiff(**attrs) -> str:
parts = []
for k in sorted(attrs.keys()): # sort for determinism
v = str(attrs[k]).upper().strip() if attrs[k] is not None else 'N/A'
parts.append(v)
return hashlib.md5('||'.join(parts).encode('utf-8')).hexdigest()
WHY PARALLEL LOADING WORKS:
Hub, link, and satellite all compute their hash keys from the SAME source columns.
No table needs to wait for another. No lock, no sequence, no coordination.
Simultaneously (all from stg_orders):
hub_customer: hub_hk = MD5(UPPER(TRIM(customer_id)))
lnk_order_cust: lnk_hk = MD5(hub_order_hk || '||' || hub_customer_hk)
sat_order_details: hub_hk = MD5(UPPER(TRIM(order_id))), hash_diff computed
All three can start and complete independently at the same instant.Loading Hubs, Links, and Satellites — The Exact Patterns
HUB LOADING RULES:
1. INSERT ONLY — never UPDATE, never DELETE.
2. If hub_hk already exists: skip (INSERT IF NOT EXISTS).
3. load_dts records FIRST SEEN date — never updated.
INSERT INTO raw_vault.hub_customer
(hub_customer_hk, customer_bk, load_dts, record_source)
SELECT DISTINCT
MD5(UPPER(TRIM(customer_id::VARCHAR))),
customer_id::VARCHAR,
CURRENT_TIMESTAMP(),
'freshmart_orders_db'
FROM staging.stg_orders
ON CONFLICT (hub_customer_hk) DO NOTHING; -- skip silently if exists ✓LINK LOADING RULES:
1. INSERT ONLY — never UPDATE, never DELETE.
2. Compute hub hash keys from source data directly (not via join to hub table).
3. ON CONFLICT DO NOTHING if relationship already recorded.
INSERT INTO raw_vault.lnk_order_customer
(lnk_order_customer_hk, hub_order_hk, hub_customer_hk, load_dts, record_source)
SELECT DISTINCT
MD5(MD5(UPPER(TRIM(order_id::VARCHAR))) || '||' || MD5(UPPER(TRIM(customer_id::VARCHAR)))),
MD5(UPPER(TRIM(order_id::VARCHAR))),
MD5(UPPER(TRIM(customer_id::VARCHAR))),
CURRENT_TIMESTAMP(),
'freshmart_orders_db'
FROM staging.stg_orders
ON CONFLICT (lnk_order_customer_hk) DO NOTHING;
EFFECTIVITY SATELLITE (for relationship end dates):
When a relationship ends (employee moves to a new department):
Do NOT delete the link row. Insert into an effectivity satellite:
SAT_EMP_DEPT_EFFECTIVITY:
lnk_hk load_dts load_end_dts is_active
HK_1 2024-01-01 2026-02-28 FALSE ← ended
HK_1 2026-03-01 NULL TRUE ← new assignment
Full history preserved — link never deleted.SATELLITE LOADING RULES:
1. INSERT ONLY (except updating load_end_dts on the row being expired).
2. New row inserted ONLY when hash_diff changes.
3. Expire previous current row: UPDATE SET load_end_dts = NOW().
4. One satellite per source system — never mix source attributes.
-- Step 1: compute hash_diff in staging, identify changed rows
WITH staged AS (
SELECT
MD5(UPPER(TRIM(customer_id::VARCHAR))) AS hub_customer_hk,
city, tier, phone_masked,
MD5(
UPPER(TRIM(COALESCE(city, 'N/A'))) || '||' ||
UPPER(TRIM(COALESCE(tier, 'N/A'))) || '||' ||
UPPER(TRIM(COALESCE(phone_masked, 'N/A')))
) AS hash_diff,
CURRENT_TIMESTAMP() AS load_dts,
'freshmart_orders_db' AS record_source
FROM staging.stg_customers
),
-- Step 2: only rows where hash_diff changed or entity is new
changed AS (
SELECT s.*
FROM staged s
LEFT JOIN (
SELECT DISTINCT ON (hub_customer_hk) hub_customer_hk, hash_diff
FROM raw_vault.sat_customer_orders_db
WHERE load_end_dts IS NULL
ORDER BY hub_customer_hk, load_dts DESC
) cur ON s.hub_customer_hk = cur.hub_customer_hk
WHERE cur.hub_customer_hk IS NULL -- new entity
OR s.hash_diff != cur.hash_diff -- attribute changed
)
-- Step 3: expire previous current rows
UPDATE raw_vault.sat_customer_orders_db
SET load_end_dts = NOW()
WHERE hub_customer_hk IN (SELECT hub_customer_hk FROM changed)
AND load_end_dts IS NULL;
-- Step 4: insert new version rows
INSERT INTO raw_vault.sat_customer_orders_db
(hub_customer_hk, load_dts, load_end_dts, hash_diff, record_source,
city, tier, phone_masked)
SELECT hub_customer_hk, load_dts, NULL, hash_diff, record_source,
city, tier, phone_masked
FROM changed;The Four Layers — Raw Vault, Business Vault, Information Mart
DATA VAULT 2.0 ARCHITECTURE:
SOURCE SYSTEMS
────────────────────────────────────────────────────────────────────────
FreshMart Orders DB Loyalty App Finance System Logistics Partner
▼ Extract → Stage (typed landing, hash keys pre-computed, record_source set)
STAGING AREA (not persistent — rebuilt each load)
────────────────────────────────────────────────────────────────────────
stg_orders_db stg_loyalty_app stg_finance stg_logistics
▼ Parallel load (all tables simultaneously)
RAW VAULT (persistent, immutable, auditable)
────────────────────────────────────────────────────────────────────────
Hubs: HUB_CUSTOMER, HUB_ORDER, HUB_STORE, HUB_PRODUCT
Links: LNK_ORDER_CUSTOMER, LNK_ORDER_STORE, LNK_ORDER_PRODUCT
Satellites: SAT_CUSTOMER_ORDERS_DB, SAT_CUSTOMER_LOYALTY_APP
SAT_ORDER_ORDERS_DB, SAT_STORE_LOGISTICS
Rules: no business rules applied, INSERT ONLY, full audit trail.
THIS IS THE SOURCE OF TRUTH.
▼ Business rules applied
BUSINESS VAULT (derived, still historical)
────────────────────────────────────────────────────────────────────────
Point-in-time (PIT) tables: pre-join snapshots of satellites
Bridge tables: traversal helpers for complex paths
Computed satellites (CSAT): business-derived attributes
Same-as-links (SAL): link two hub records = same real entity
Rules: business rules applied (reconcile conflicting sources),
INSERT-only / history-preserving, not exposed directly to analysts.
▼ Dimensional transform
INFORMATION MART (volatile, consumer-specific)
────────────────────────────────────────────────────────────────────────
Standard dimensional model: fct_orders, dim_customer, dim_store, etc.
Rules: rebuilt from Business Vault / Raw Vault at any time.
NOT the source of truth. The Raw Vault is.Point-in-time (PIT) tables — the Business Vault query accelerator
PROBLEM:
To get a customer's current attributes from the Raw Vault you need:
JOIN hub → satellite with complex date range filters → repeat for each satellite
At scale (billions of satellite rows): slow and complex.
PIT TABLE SOLUTION:
Pre-computed snapshot: for each entity at each snapshot date,
records the load_dts of the relevant satellite row at that moment.
PIT_CUSTOMER (daily snapshots):
hub_customer_hk snapshot_dts sat_ord_ldts sat_loy_ldts
MD5('4201938') 2026-03-17 23:59:59 2026-02-01 06:00:00 2024-03-01 09:00:00
MD5('4201938') 2026-03-16 23:59:59 2024-01-15 06:00:00 2024-03-01 09:00:00
QUERY with PIT (equality join — fast):
SELECT sc.city, sc.tier, sl.loyalty_points
FROM pit_customer p
JOIN sat_customer_orders_db sc
ON p.hub_customer_hk = sc.hub_customer_hk
AND p.sat_ord_ldts = sc.load_dts -- equality join via PIT
JOIN sat_customer_loyalty_app sl
ON p.hub_customer_hk = sl.hub_customer_hk
AND p.sat_loy_ldts = sl.load_dts
WHERE p.snapshot_dts = '2026-03-17 23:59:59'
AND p.hub_customer_hk = MD5('4201938');
-- PIT eliminates the expensive date-range satellite scan.
-- Fast indexed equality joins replace slow range queries.
SAME-AS-LINK (SAL):
Two hub records represent the same real entity:
customer_bk='4201938' (orders_db) = customer_bk='USR-42019' (loyalty_app)
SAL_CUSTOMER records this equivalence:
hub_customer_hk_1 hub_customer_hk_2 confidence load_dts
MD5('4201938') MD5('USR-42019') 0.98 2026-01-01 ...
Business Vault resolves which key to use as "master" for reporting.Choosing Between Data Vault and Dimensional Modelling
| Dimension | Data Vault 2.0 | Dimensional (Kimball) |
|---|---|---|
| Primary use case | Multi-source enterprise integration, regulated industries | Single/few sources, analytical reporting, BI dashboards |
| Schema change impact | Low — new attribute = new satellite column, nothing else changes | High — may require ETL remodelling and pipeline changes |
| Source system count | Designed for 10+ heterogeneous sources | Best with 1-5 well-understood sources |
| Auditability | Complete — every row has load_dts + record_source, nothing ever deleted | Partial — SCD Type 2 preserves history, no per-row source audit |
| Analyst query complexity | High — Raw Vault not queryable directly; needs information mart layer | Low — star schema queries are simple and predictable |
| Load performance | Excellent — parallel loading, INSERT ONLY, no locks | Good — sequential SCD2 logic, dependency between tables |
| Business rule handling | Separated into Business Vault — raw vault contains no rules | Applied in ETL — mixed with transformation |
| Regulatory compliance | Superior — immutable history, source tracking, zero-trust | Adequate — SCD2 provides history but not full auditability |
| Implementation complexity | High — more table types, hash standards, PIT tables, SALs | Medium — widely understood, standard SQL |
| Best for | Banks, insurance, telecom, government, healthcare | SaaS, e-commerce, startups, analytics-first companies |
Implementing Data Vault in dbt — AutomateDV
# STAGING MODEL (stg_orders.sql) — pre-compute all hash keys first:
{{ config(materialized='view') }}
SELECT
{{ automate_dv.hash('customer_id', 'MD5') }} AS hub_customer_hk,
{{ automate_dv.hash('order_id', 'MD5') }} AS hub_order_hk,
{{ automate_dv.hash(['order_id', 'customer_id'], 'MD5') }} AS lnk_order_customer_hk,
{{ automate_dv.hash(['status', 'amount', 'delivery_fee'],'MD5') }} AS order_hashdiff,
order_id, customer_id, status, amount, delivery_fee,
CURRENT_TIMESTAMP() AS load_dts,
'freshmart_orders_db' AS record_source
FROM {{ source('staging', 'orders') }}
# HUB MODEL (hub_customer.sql):
{{ config(materialized='incremental', unique_key='hub_customer_hk') }}
{{- automate_dv.hub(
src_pk = 'hub_customer_hk',
src_nk = 'customer_id',
src_ldts = 'load_dts',
src_source = 'record_source',
source_model = 'stg_orders',
) -}}
# LINK MODEL (lnk_order_customer.sql):
{{ config(materialized='incremental', unique_key='lnk_order_customer_hk') }}
{{- automate_dv.link(
src_pk = 'lnk_order_customer_hk',
src_fk = ['hub_order_hk', 'hub_customer_hk'],
src_ldts = 'load_dts',
src_source = 'record_source',
source_model = 'stg_orders',
) -}}
# SATELLITE MODEL (sat_order_details.sql):
{{ config(materialized='incremental', unique_key=['hub_order_hk','load_dts']) }}
{{- automate_dv.sat(
src_pk = 'hub_order_hk',
src_hashdiff = 'order_hashdiff',
src_payload = ['status', 'amount', 'delivery_fee'],
src_ldts = 'load_dts',
src_eff = 'load_end_dts',
src_source = 'record_source',
source_model = 'stg_orders',
) -}}
# dbt project structure:
freshmart_vault/
├── models/
│ ├── staging/ ← hash key pre-computation (views)
│ ├── raw_vault/
│ │ ├── hubs/ ← hub_customer.sql, hub_order.sql ...
│ │ ├── links/ ← lnk_order_customer.sql ...
│ │ └── sats/ ← sat_customer_orders_db.sql ...
│ ├── business_vault/
│ │ ├── pit/ ← pit_customer.sql ...
│ │ └── bridges/
│ └── marts/
│ ├── dims/ ← dim_customer.sql (from PIT + SAT joins)
│ └── facts/ ← fct_orders.sqlWhen Data Vault Is Overkill — The Honest Assessment
FreshMart Acquires QuickBasket — Why the Vault Handles Integration Cleanly
QuickBasket has customer_id range 1–500,000 (overlapping with FreshMart), different customer attributes (preferred_store), and a different order schema (basket_value not order_amount). Integration must complete in 6 weeks without disrupting existing analytics.
DIMENSIONAL MODEL APPROACH (painful):
1. Re-assign all QuickBasket customer_ids to avoid collision (prefix 'QB-')
2. Add preferred_store to dim_customer (schema change — affects ALL existing queries)
3. Map basket_value → order_amount in ETL (business rule decision)
4. Reload all QuickBasket history into existing fact tables
Duration: 4-6 weeks, high risk to existing analytics.
DATA VAULT APPROACH (additive, zero breaking changes):
Day 1: New staging area for QuickBasket
stg_quickbasket_customers, stg_quickbasket_orders
record_source = 'quickbasket_orders_db'
Key prefix to guarantee uniqueness: MD5('QB_' || UPPER(TRIM(customer_id)))
Day 2: Load QuickBasket customers into existing HUB_CUSTOMER
INSERT INTO hub_customer (hub_customer_hk, customer_bk, load_dts, record_source)
SELECT MD5('QB_' || UPPER(TRIM(customer_id))), 'QB_' || customer_id,
CURRENT_TIMESTAMP(), 'quickbasket_orders_db'
FROM stg_quickbasket_customers
ON CONFLICT DO NOTHING;
← Existing FreshMart hub rows: UNTOUCHED
Day 3: New satellite for QuickBasket-specific attributes
SAT_CUSTOMER_QUICKBASKET — new table, never touches SAT_CUSTOMER_ORDERS_DB
Columns: hub_customer_hk, load_dts, preferred_store, loyalty_tier, ...
← EXISTING ANALYTICS UNAFFECTED — existing sats unchanged
Days 4-5: QuickBasket orders into existing links
LNK_ORDER_CUSTOMER accepts QuickBasket orders — record_source distinguishes them.
Weeks 2-4: Same-as-link for identity resolution
Match QuickBasket customers to FreshMart customers via email/phone/name.
SAL_CUSTOMER records matched pairs.
Business Vault resolves "master" identity.
Weeks 5-6: Information Mart rebuilt to include QuickBasket
dim_customer query reads from BOTH satellites (Business Vault resolves conflicts).
fct_orders includes QuickBasket orders.
Existing FreshMart-only metrics: filter by record_source='freshmart_orders_db'.
RESULT: Zero breaking changes to existing analytics.
QuickBasket data added in parallel new satellites.
The Raw Vault INSERT-ONLY design made this structurally safe.5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Data Vault 2.0 has three table types: Hubs (store business keys — the existence of an entity), Links (store relationships between hubs), and Satellites (store descriptive attributes with full version history per source). Every table in a Raw Vault is exactly one of these three types.
- ✓Hash keys (MD5 or SHA-256 of UPPER(TRIM(business_key))) are deterministic and computable from source data alone — no shared state, no sequence generator, no locks. This enables all hub, link, and satellite tables to load in complete parallel from the same source data simultaneously.
- ✓Every satellite row has: hub_hk (FK to parent), load_dts (when loaded), load_end_dts (when expired — NULL if current), hash_diff (hash of all business attributes), and record_source. A new row is inserted only when hash_diff changes. Never include pipeline-generated timestamps in hash_diff.
- ✓Each source system gets its own satellite. If two sources describe the same customer, there are two separate satellites. This preserves both versions for audit. Business Vault computed satellites (CSAT) apply the agreed source resolution rules for reporting — marts never join raw satellites directly.
- ✓Point-in-Time (PIT) tables pre-compute which satellite row was current at each snapshot date, transforming expensive date-range satellite scans into fast equality joins. PIT tables are the primary query interface for building Information Marts from the vault.
- ✓The four-layer architecture: Source → Staging (hash keys pre-computed) → Raw Vault (INSERT-only, no business rules, the source of truth) → Business Vault (PIT tables, SALs, source resolution) → Information Mart (dimensional model for analysts). Information Marts are rebuilt views — the Raw Vault is the source of truth.
- ✓Adding a new source system changes nothing in the existing Raw Vault. New hub rows are added with a new record_source. New satellites are created for new attributes. Existing satellites, hubs, and links are completely untouched. Existing analytics continue working unchanged.
- ✓Data Vault beats dimensional modelling for: 10+ heterogeneous sources with conflicting keys, frequent schema changes, and regulatory auditability requirements. Dimensional modelling beats Data Vault for: single sources, stable schemas, analytics-first teams, and when sub-second query performance is critical.
- ✓AutomateDV (dbtvault) provides dbt macros — hub(), link(), sat(), pit() — that generate correct vault SQL from configuration. The staging model pre-computes all hash keys. All vault models call the macros. Business Vault computed satellites apply source resolution rules.
- ✓The hybrid enterprise pattern: Data Vault for integration (Raw Vault + Business Vault), dimensional model for delivery (Information Mart). Never choose Data Vault because it sounds more enterprise-grade — the complexity cost is real and must be justified by specific multi-source integration and auditability requirements.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.