Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Intermediate+150 XP

Amazon Athena

Athena is AWS serverless SQL query engine. It queries data directly in S3 using standard SQL — no database to set up, no cluster to manage, no data to load. You pay only for the data scanned per query.

13 min read March 2026

What is Amazon Athena?

Athena lets you run SQL queries directly against files in S3 — CSV, JSON, Parquet, ORC, Avro. There is no database to set up, no ETL to load data, no cluster to manage. You point Athena at an S3 path, define a schema, and start querying.

The pricing model is what makes Athena unique: you pay $5 per terabyte of data scanned. Use Parquet with partition pruning and most queries cost fractions of a cent. SELECT * on a 10TB CSV table costs $50 per run. Format and partitioning decisions have direct financial consequences.

Athena vs Redshift
Athena is for ad-hoc queries on S3 data — no loading, no maintenance, pay per query. Redshift is for sustained, high-concurrency analytical workloads with consistent performance. Use Athena for exploration and infrequent queries. Use Redshift for dashboards and reports that run constantly.

Core Concepts

Serverless

No cluster to provision. Athena spins up compute on demand and tears it down after your query. You pay per query, not per hour.

Pay per scan

$5 per terabyte of data scanned. Partition pruning and Parquet format reduce scanned data by 90%+ — this directly reduces your bill.

Glue Catalog

Athena uses the AWS Glue Data Catalog as its schema registry. Tables defined in Glue are immediately queryable in Athena.

Iceberg support

Native Apache Iceberg support for ACID transactions, MERGE statements, and time travel — turning S3 into a proper transactional data lake.

Federated Query

Query data in RDS, Redshift, DynamoDB, and on-prem databases from Athena SQL — no data movement required.

Workgroups

Isolate teams and set query cost limits. A workgroup can limit individual queries to scan at most 1GB — protecting against accidental expensive queries.

Creating Tables and Querying S3

Athena tables are metadata only — they define the schema and S3 location. The data stays in S3. Run MSCK REPAIR TABLE after adding new partitions so Athena can find them.

create_athena_table.sql
sql
-- Create an external table pointing at S3 Parquet files
-- Athena never moves the data — it queries S3 directly
-- You only pay for bytes scanned

CREATE EXTERNAL TABLE sales_silver (
    order_id       STRING,
    customer_id    STRING,
    product_id     STRING,
    region         STRING,
    revenue        DOUBLE,
    order_date     DATE,
    load_ts        TIMESTAMP
)
PARTITIONED BY (year INT, month INT, day INT)   -- partition pruning
STORED AS PARQUET                               -- columnar format = less data scanned
LOCATION 's3://your-bucket/silver/sales/'
TBLPROPERTIES ('parquet.compress' = 'SNAPPY');

-- After creating the table, load partition metadata
-- Without this, Athena cannot find the partitions
MSCK REPAIR TABLE sales_silver;

-- Now query it — only partitions matching the WHERE clause are scanned
SELECT
    region,
    SUM(revenue)    AS total_revenue,
    COUNT(order_id) AS total_orders
FROM sales_silver
WHERE year = 2025 AND month = 3
GROUP BY region
ORDER BY total_revenue DESC;

Apache Iceberg Tables in Athena

Plain Parquet tables on S3 are read-only — you cannot update or delete rows. Iceberg tables add ACID transactions, MERGE statements, and time travel. AWS has made Iceberg a first-class citizen in Athena.

athena_iceberg.sql
sql
-- Athena natively supports Apache Iceberg tables
-- Iceberg adds ACID transactions, upserts, and time travel on S3

-- Create an Iceberg table (no external keyword — Athena manages metadata)
CREATE TABLE sales_iceberg (
    order_id    STRING,
    customer_id STRING,
    revenue     DOUBLE,
    order_date  DATE,
    region      STRING
)
PARTITIONED BY (region)
LOCATION 's3://your-bucket/iceberg/sales/'
TBLPROPERTIES ('table_type' = 'ICEBERG');

-- MERGE — upsert support (not available on plain Parquet tables)
MERGE INTO sales_iceberg AS target
USING new_sales AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN
    UPDATE SET revenue = source.revenue, region = source.region
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, revenue, order_date, region)
    VALUES (source.order_id, source.customer_id, source.revenue, source.order_date, source.region);

-- Time travel — query data as it was yesterday
SELECT * FROM sales_iceberg
FOR TIMESTAMP AS OF (CURRENT_TIMESTAMP - INTERVAL '1' DAY);

Cost Optimization

The biggest Athena cost lever is data scanned per query. These three practices reduce costs by 90% or more.

athena_optimization.sql
sql
-- Cost optimization: Athena charges $5 per TB scanned
-- These practices reduce cost by 90%+

-- BAD: scans all columns in all partitions
SELECT * FROM sales_silver;

-- GOOD: partition filter + column selection
-- Only reads March 2025 data, only 3 columns
SELECT order_date, region, revenue
FROM sales_silver
WHERE year = 2025 AND month = 3;

-- Convert CSV to Parquet (10x less data scanned)
-- Run once — all future queries on the Parquet table are 10x cheaper
CREATE TABLE sales_parquet
WITH (
    format = 'PARQUET',
    parquet_compression = 'SNAPPY',
    partitioned_by = ARRAY['year', 'month', 'day'],
    external_location = 's3://your-bucket/silver/sales_parquet/'
)
AS SELECT *, year(order_date) AS year, month(order_date) AS month, day(order_date) AS day
FROM sales_csv_raw;

-- Check query cost before running — use EXPLAIN
EXPLAIN SELECT region, SUM(revenue) FROM sales_silver WHERE year = 2025;
Always convert CSV to Parquet
A 100GB CSV table costs $0.50 per full scan. The same data as Parquet with partitioning costs $0.005 for a typical filtered query — 100x cheaper. Converting to Parquet is a one-time cost that pays back immediately.

Running Athena Queries from Python

athena_boto3.py
python
# Run Athena queries from Python — boto3
# pip install boto3

import boto3
import time
import pandas as pd

athena = boto3.client('athena', region_name='us-east-1')

def run_query(sql: str, database: str, output_bucket: str) -> pd.DataFrame:
    # Start query execution
    response = athena.start_query_execution(
        QueryString=sql,
        QueryExecutionContext={'Database': database},
        ResultConfiguration={
            'OutputLocation': f's3://{output_bucket}/athena-results/'
        }
    )
    query_id = response['QueryExecutionId']

    # Poll until complete
    while True:
        status = athena.get_query_execution(QueryExecutionId=query_id)
        state  = status['QueryExecution']['Status']['State']
        if state in ('SUCCEEDED', 'FAILED', 'CANCELLED'):
            break
        time.sleep(1)

    if state != 'SUCCEEDED':
        raise Exception(f"Query failed: {status['QueryExecution']['Status']['StateChangeReason']}")

    # Fetch results as DataFrame
    results = athena.get_query_results(QueryExecutionId=query_id)
    cols = [c['Label'] for c in results['ResultSet']['ResultSetMetadata']['ColumnInfo']]
    rows = [[field.get('VarCharValue', '') for field in row['Data']]
            for row in results['ResultSet']['Rows'][1:]]  # skip header row
    return pd.DataFrame(rows, columns=cols)

# Use it
df = run_query(
    sql="SELECT region, SUM(revenue) AS total FROM sales_silver WHERE year=2025 AND month=3 GROUP BY region",
    database="analytics",
    output_bucket="your-results-bucket"
)
print(df.head())

Athena in a Data Lake Pipeline

S3 (Parquet + partitioned)Glue Catalog (schema)Athena (SQL queries)QuickSight / Tableau

🎯 Key Takeaways

  • Athena queries S3 directly — no cluster, no data loading, pay per terabyte scanned
  • Parquet + partitioning reduces query cost by 90%+ compared to CSV without partitions
  • MSCK REPAIR TABLE must be run after adding new partitions to make them visible
  • Iceberg tables add MERGE, DELETE, UPDATE and time travel to S3 data
  • Use Workgroups to set per-query scan limits and prevent accidental expensive queries
  • Athena uses the Glue Data Catalog — tables defined in Glue are immediately queryable
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...