AWSStorage

Amazon S3 for Data Engineers — Beyond Just File Storage

February 24, 2026 6 min read✍️ by Asil

Most data engineers use S3 as a file system — upload files, download files. But S3 has a rich set of features specifically useful for data engineering that most engineers never learn: lifecycle policies, event notifications, S3 Select, Requester Pays, and Intelligent Tiering.

S3 storage classes and lifecycle policies

S3 has multiple storage classes with different cost and retrieval speed tradeoffs:

S3 Standard: hot data, frequent access, highest cost

S3 Standard-IA (Infrequent Access): data accessed monthly, 40% cheaper than Standard

S3 Glacier Instant Retrieval: archive data accessed quarterly, 68% cheaper

S3 Glacier Deep Archive: long-term archive, lowest cost, 12-hour retrieval

Lifecycle policies automatically transition objects between classes based on age. Bronze layer data older than 90 days → Standard-IA. Older than 365 days → Glacier. This runs automatically with no code required.

S3 event notifications for pipeline triggers

S3 can trigger Lambda functions, SQS queues, or SNS topics when objects are created, deleted, or restored.

Common pattern: a source system drops a CSV file in S3 → S3 event triggers a Lambda → Lambda starts a Glue job to process the file → Glue writes Parquet to processed/ prefix.

This creates an event-driven pipeline that runs automatically when new data arrives — no polling, no scheduler, no wasted compute waiting for files.

S3 Select — query without downloading

S3 Select lets you run SQL-like queries against individual S3 objects and return only the matching rows — without downloading the entire file.

For a 10GB CSV file: SELECT * WHERE region = US returns only US rows, transferring perhaps 500MB instead of 10GB.

Useful for: quick data inspection, lightweight filtering before full processing, and reducing Lambda function memory requirements when processing large files.

Partitioning strategy for S3 data lakes

S3 objects are accessed by prefix (folder-like path). Athena, Glue, and Spark all prune partitions based on S3 prefixes.

Optimal partition structure for most data engineering use cases:

s3://bucket/table-name/year=2026/month=03/day=15/

Hive-style partitioning (key=value format) is recognized automatically by Glue catalog, Athena, and Spark. This enables predicate pushdown — only partitions matching your WHERE clause are read.

Ready to apply this?

Learn Amazon S3 in depth

Back to all articles