Amazon S3 for Data Engineers — Beyond Just File Storage
Most data engineers use S3 as a file system — upload files, download files. But S3 has a rich set of features specifically useful for data engineering that most engineers never learn: lifecycle policies, event notifications, S3 Select, Requester Pays, and Intelligent Tiering.
S3 storage classes and lifecycle policies
S3 has multiple storage classes with different cost and retrieval speed tradeoffs:
S3 Standard: hot data, frequent access, highest cost
S3 Standard-IA (Infrequent Access): data accessed monthly, 40% cheaper than Standard
S3 Glacier Instant Retrieval: archive data accessed quarterly, 68% cheaper
S3 Glacier Deep Archive: long-term archive, lowest cost, 12-hour retrieval
Lifecycle policies automatically transition objects between classes based on age. Bronze layer data older than 90 days → Standard-IA. Older than 365 days → Glacier. This runs automatically with no code required.
S3 event notifications for pipeline triggers
S3 can trigger Lambda functions, SQS queues, or SNS topics when objects are created, deleted, or restored.
Common pattern: a source system drops a CSV file in S3 → S3 event triggers a Lambda → Lambda starts a Glue job to process the file → Glue writes Parquet to processed/ prefix.
This creates an event-driven pipeline that runs automatically when new data arrives — no polling, no scheduler, no wasted compute waiting for files.
S3 Select — query without downloading
S3 Select lets you run SQL-like queries against individual S3 objects and return only the matching rows — without downloading the entire file.
For a 10GB CSV file: SELECT * WHERE region = US returns only US rows, transferring perhaps 500MB instead of 10GB.
Useful for: quick data inspection, lightweight filtering before full processing, and reducing Lambda function memory requirements when processing large files.
Partitioning strategy for S3 data lakes
S3 objects are accessed by prefix (folder-like path). Athena, Glue, and Spark all prune partitions based on S3 prefixes.
Optimal partition structure for most data engineering use cases:
s3://bucket/table-name/year=2026/month=03/day=15/
Hive-style partitioning (key=value format) is recognized automatically by Glue catalog, Athena, and Spark. This enables predicate pushdown — only partitions matching your WHERE clause are read.