Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Back to Blog
AzureStorage

ADLS Gen2 Best Practices — How to Structure Your Azure Data Lake

March 14, 2026 6 min read✍️ by Asil

Most Azure data engineers set up ADLS Gen2 incorrectly at the start. The mistakes made early — poor container structure, wrong partitioning strategy, incorrect access controls — are expensive to fix later when pipelines are running at scale.

Container structure

Use one container per Medallion layer, not one container per project:

bronze/

silver/

gold/

Inside each container, organize by domain then by source system then by date:

bronze/sales/orders/2026/03/15/

bronze/hr/employees/2026/03/15/

This structure makes lifecycle management policies easy — you can expire bronze data older than 90 days at the container prefix level without affecting silver or gold.

Partitioning strategy

Always partition by date. For most batch pipelines: year/month/day partitioning.

bronze/sales/orders/year=2026/month=03/day=15/

This date-based partition pruning is critical for performance. A query for last week reads 7 partitions, not the entire dataset.

For high-volume tables: also partition by a secondary key like region or source_system. But do not over-partition — partitions smaller than 128MB create the small files problem.

Access control

Use Azure RBAC for container-level access and ACLs for folder-level access. Never use storage account keys in application code — use Managed Identity or Service Principals.

In practice: Databricks clusters get a Service Principal with Storage Blob Data Contributor on the storage account. ADF pipelines use Managed Identity. No passwords or keys in code or configuration.

Set up separate Service Principals for bronze writes, silver reads/writes, and gold reads. Principle of least privilege prevents a silver transformation bug from accidentally overwriting bronze.

Small files problem and how to fix it

Streaming pipelines and frequent micro-batch runs create thousands of tiny Parquet files. Querying 10,000 files of 1MB each is far slower than querying 10 files of 1GB each.

Fix with Delta Lake OPTIMIZE: run OPTIMIZE table on a schedule (daily is usually enough). This compacts small files into larger ones automatically.

For Synapse external tables pointing at ADLS: use CETAS (Create External Table As Select) to produce clean, evenly-sized Parquet files.

Ready to apply this?

Deep dive into ADLS Gen2