Amazon S3 — Your Data Lake on AWS
S3 is the foundation of every AWS data engineering stack. Every pipeline reads from it, writes to it, or both. Once you understand S3 well, everything else on AWS makes more sense.
What S3 is — and what it is not
S3 is object storage. Not a file system, not a database, not a block device. An object is any file — CSV, Parquet, JSON, image, video — stored with a key that looks like a file path but is not really one.
The key difference from ADLS Gen2 is that S3 does not have a true hierarchical namespace by default. A path like s3://mybucket/bronze/sales/2025-03-01/sales.csv is just a key with slashes in the name — there is no actual folder called bronze. This matters when you do rename operations or list large directories — it can be slower than ADLS.
AWS added S3 Tables in 2024 which brings native Iceberg support to S3 with a proper catalog. For new projects targeting AWS, S3 Tables is worth understanding.
Bucket structure for a Medallion Architecture
# Option 1: One bucket, prefix-based layers (common for smaller setups)
s3://company-datalake/
├── bronze/
│ └── sales/year=2025/month=03/day=01/sales.csv
├── silver/
│ └── orders/year=2025/month=03/day=01/part-0000.parquet
└── gold/
├── daily_sales_summary/
└── customer_ltv/
# Option 2: One bucket per layer (better access control isolation)
s3://company-bronze/
s3://company-silver/
s3://company-gold/Option 2 is cleaner for access control — you can give Glue jobs read/write access to bronze and silver, but analysts only get read access to gold. With option 1 you need to use prefix-level policies which are more complex.
Reading and writing S3 in Python
import boto3
import pandas as pd
from io import BytesIO
# boto3 is the AWS Python SDK — install with: pip install boto3
s3 = boto3.client('s3')
# Download a file and read it as a DataFrame
response = s3.get_object(Bucket='company-bronze', Key='sales/2025-03-01/sales.csv')
df = pd.read_csv(BytesIO(response['Body'].read()))
print(f"Loaded {len(df)} rows")
# Upload a processed DataFrame to S3
buffer = BytesIO()
df.to_parquet(buffer, index=False)
buffer.seek(0)
s3.put_object(
Bucket='company-silver',
Key='orders/2025-03-01/orders.parquet',
Body=buffer.getvalue()
)
print("Uploaded to silver")
# List files with a specific prefix (like listing a folder)
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket='company-bronze', Prefix='sales/2025-03-'):
for obj in page.get('Contents', []):
print(obj['Key'], obj['Size'])Reading S3 from PySpark (in AWS Glue or EMR)
# In a Glue job or EMR notebook, the Spark context is already configured
# You just use the s3:// URI directly
df = spark.read.parquet("s3://company-bronze/sales/year=2025/month=03/")
df_clean = df.dropna(subset=["order_id", "amount"])
df_clean.write .format("delta") .mode("overwrite") .partitionBy("order_date") .save("s3://company-silver/orders/")IAM — how access control works on AWS
On AWS, access is controlled by IAM (Identity and Access Management). Every service that touches S3 — Glue, Lambda, EMR, Athena — needs an IAM role with explicit permissions.
The key principle is least privilege: give each service only the access it actually needs. A Glue job that reads from bronze and writes to silver should have read-only on the bronze bucket and read-write on silver. Nothing else.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::company-bronze",
"arn:aws:s3:::company-bronze/*"
]
},
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::company-silver",
"arn:aws:s3:::company-silver/*"
]
}
]
}S3 storage classes — how to cut costs
S3 has several storage classes at different price points. For a data lake, the practical ones are:
S3 Standard — full price, fast access. Use for Silver and Gold that analysts query regularly.
S3 Standard-IA (Infrequent Access) — 40% cheaper than Standard. Use for Bronze data older than 30 days. Slightly higher retrieval cost but most Bronze data is never read after initial processing.
S3 Glacier Instant Retrieval — 68% cheaper than Standard. Use for Bronze data older than 90 days that you keep for compliance but almost never touch.
Set this up with S3 Lifecycle Rules — the same concept as ADLS lifecycle management. You define rules like "move objects in bronze/ to IA after 30 days, Glacier after 90 days" and AWS handles it automatically.
Lifecycle policies, event notifications, S3 Select — beyond just file storage.
CSV vs Parquet — what actually happens in production and why every pipeline uses columnar.
The architecture replacing the data warehouse — and why the whole industry is moving to it.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.