AWS Lake Formation
Lake Formation is the AWS data governance and security layer for your data lake. It controls who can access which databases, tables, columns, and rows in your S3-based data lake — replacing complex IAM and S3 bucket policies with a centralized, fine-grained permission system.
What is AWS Lake Formation?
Before Lake Formation, securing a data lake meant writing complex IAM policies and S3 bucket policies for every user and team. A data analyst needed access to one table in one database — you wrote a policy granting S3 GetObject on a specific prefix, Glue GetTable on that table, and Athena access. Multiply this by dozens of analysts, dozens of tables, and it became unmanageable.
Lake Formation replaces all of that with a single permission model. You grant access at the database, table, column, or row level through Lake Formation — it translates those permissions into the underlying IAM and S3 policies automatically.
Core Concepts
The IAM principal that manages Lake Formation. Can register S3 locations, create databases, and grant permissions to others. Usually a data platform team role.
Metadata registered in the AWS Glue Data Catalog. Lake Formation secures access to these — not to the underlying S3 files directly.
Tag-based access control. Tag databases, tables, and columns with key-value pairs, then grant permissions based on tags instead of individual resources.
Row and column level security. Define which rows and columns a principal can see. Filters apply transparently — the user sees a restricted view.
Lake Formation managed tables with ACID transactions, automatic compaction, and row-level security built in — similar to Iceberg but AWS-native.
Share specific tables from your Lake Formation catalog with other AWS accounts without copying data. The data stays in your S3 bucket.
Setting Up Lake Formation
The setup process: register your S3 bucket as a data lake location, create databases in the Glue Catalog, then grant table-level permissions to IAM users and roles. Lake Formation handles the underlying S3 and Glue permissions automatically.
# Register S3 locations and create a data lake using boto3
import boto3
lf = boto3.client('lakeformation', region_name='us-east-1')
# Step 1: Register S3 bucket as a Lake Formation data lake location
lf.register_resource(
ResourceArn='arn:aws:s3:::your-data-lake-bucket',
UseServiceLinkedRole=True # Lake Formation manages IAM automatically
)
# Step 2: Create a database in the Glue Catalog (Lake Formation manages access)
glue = boto3.client('glue', region_name='us-east-1')
glue.create_database(
DatabaseInput={
'Name': 'sales_silver',
'Description': 'Cleaned and validated sales data — Silver layer',
'LocationUri': 's3://your-data-lake-bucket/silver/sales/'
}
)
# Step 3: Grant permissions to a data analyst IAM user
lf.grant_permissions(
Principal={
'DataLakePrincipalIdentifier': 'arn:aws:iam::123456:user/data-analyst-john'
},
Resource={
'Table': {
'DatabaseName': 'sales_silver',
'TableWildcard': {} # all tables in this database
}
},
Permissions=['SELECT'], # read-only
PermissionsWithGrantOption=[] # cannot grant to others
)
print("Lake Formation setup complete")Column-Level and Row-Level Security
Lake Formation data filters apply transparently. An analyst running SELECT * on an orders table only sees the columns and rows they are permitted to see — even if they write unrestricted SQL.
-- Column-level security with Lake Formation
-- Analysts can query the table but cannot see PII columns
-- Grant SELECT on specific columns only
-- (done via Lake Formation console or API — not SQL)
-- Principal: arn:aws:iam::123456:role/AnalystRole
-- Resource: sales_silver.orders
-- Columns allowed: order_id, product_id, region, revenue, order_date
-- Columns EXCLUDED: customer_email, customer_phone, customer_address
-- When the analyst runs this query — it works:
SELECT order_id, region, revenue, order_date
FROM sales_silver.orders
WHERE order_date >= '2025-01-01';
-- When the analyst runs this — ACCESS DENIED:
SELECT customer_email, revenue
FROM sales_silver.orders;
-- Error: User does not have SELECT permission on column customer_email
-- Row-level security — analysts only see their region:
-- Lake Formation data filter: region = 'US-WEST'
-- Analyst in West team only ever sees West region rows
-- Even SELECT * returns only their permitted rowsLF-Tags — Scale Access Control
Granting permissions per table does not scale. LF-Tags let you tag data assets with metadata (domain, sensitivity) and grant permissions based on those tags. Add a new table — just tag it, permissions apply automatically.
# LF-Tags (Lake Formation Tag-Based Access Control)
# Tag data assets, then grant permissions based on tags
# Much easier to manage than granting per-table/per-column
import boto3
lf = boto3.client('lakeformation', region_name='us-east-1')
# Create tags
lf.create_lf_tag(TagKey='sensitivity', TagValues=['public', 'internal', 'confidential', 'pii'])
lf.create_lf_tag(TagKey='domain', TagValues=['sales', 'hr', 'finance', 'marketing'])
# Tag a database
lf.add_lf_tags_to_resource(
Resource={ 'Database': { 'Name': 'hr_silver' } },
LFTags=[
{ 'TagKey': 'sensitivity', 'TagValues': ['confidential'] },
{ 'TagKey': 'domain', 'TagValues': ['hr'] }
]
)
# Grant access based on tags — not individual resources
# HR analysts get access to ALL assets tagged domain=hr, sensitivity=internal
lf.grant_permissions(
Principal={ 'DataLakePrincipalIdentifier': 'arn:aws:iam::123456:role/HRAnalystRole' },
Resource={
'LFTagPolicy': {
'ResourceType': 'TABLE',
'Expression': [
{ 'TagKey': 'domain', 'TagValues': ['hr'] },
{ 'TagKey': 'sensitivity', 'TagValues': ['internal', 'public'] }
]
}
},
Permissions=['SELECT']
)
# Now adding a new HR table? Just tag it domain=hr — permissions apply automaticallyHow Lake Formation Fits in the AWS Stack
🎯 Key Takeaways
- ✓Lake Formation replaces complex S3 + IAM + Glue policies with one centralized permission model
- ✓Permissions are set at database, table, column, and row level — much finer than S3 bucket policies
- ✓LF-Tags enable tag-based access control — grant access to all assets with a tag instead of per table
- ✓Data filters enforce column and row security transparently — analysts see only what they are allowed to see
- ✓Athena, Redshift Spectrum, and EMR all respect Lake Formation permissions automatically
- ✓Cross-account sharing lets you share specific tables with other AWS accounts without copying data
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.