Batch vs Streaming vs Micro-Batch
When each processing model is right, the trade-offs nobody talks about, and how modern systems blend all three.
The Question Is Not Which Is Better — It Is Which Fits the Problem
Streaming is not always better than batch. Batch is not always the safe conservative choice. Micro-batch is not a compromise that gets the best of both worlds — it has its own distinct trade-offs. The choice between these three processing models is a fundamental architectural decision that affects everything downstream: infrastructure cost, operational complexity, latency guarantees, and failure modes.
Most real-world data platforms use all three simultaneously. A mature FreshMart data platform has: a nightly batch job that reprocesses the full day's orders for the finance report (batch), a Kafka consumer that updates the real-time delivery tracking dashboard (streaming), and an hourly Spark job that updates customer segmentation (micro-batch). Understanding when each model is appropriate — not which one to use everywhere — is the skill this module builds.
Batch Processing — The Foundation
Batch processing runs a pipeline on a fixed schedule, processing a bounded set of data in one execution. The pipeline starts, reads all the data in its scope, processes it, writes the results, and exits. The next run starts at the next scheduled interval. Everything between runs is accumulated and processed together — hence "batch."
Batch processing is the default processing model and the correct choice for the majority of data engineering workloads. Its simplicity is a feature, not a limitation. A batch pipeline has a clear start, a clear end, a clear scope, and a clear success criterion. Every debugging session, every rerun, every backfill operates on well-understood bounded data.
How batch processing works internally
BATCH PIPELINE EXECUTION CYCLE:
T=06:00 Pipeline starts (triggered by cron or Airflow)
Run parameters: run_date = 2026-03-17
T=06:01 Extract phase:
Read all orders WHERE date = '2026-03-17'
→ 48,234 rows fetched from PostgreSQL replica
→ Written to S3 Bronze as Parquet
T=06:08 Transform phase (dbt):
dbt run --select models/silver/orders.sql
→ 48,234 rows cleaned and typed
→ 47 rows rejected (written to DLQ)
→ Written to silver.orders partition date=2026-03-17
T=06:14 Aggregate phase (dbt Gold):
dbt run --select models/gold/daily_revenue.sql
→ Computes SUM, COUNT, AVG per store per category
→ Written to gold.daily_revenue
T=06:17 PIPELINE EXITS — run complete
Duration: 17 minutes
Status: SUCCESS
Next run: T+24h at 06:00
WHAT HAPPENED BETWEEN T=06:17 AND T=06:00 NEXT DAY:
→ The pipeline does not exist as a running process
→ No compute resources are consumed
→ Data written between runs accumulates, waiting for next batch
→ A customer who placed an order at 08:00 PM won't see it in
analytics until 06:17 AM the NEXT day (22+ hour latency)
THIS IS FINE when:
The business question is "what was yesterday's revenue?" (answered at 6:17 AM)
The business question is "how did this week's promotions perform?" (daily is enough)
THIS IS NOT FINE when:
The business question is "is there fraud happening RIGHT NOW?"
The business question is "what is the delivery driver's current location?"Why batch is often the right answer despite its latency
Streaming Processing — Continuous, Event-Driven
Streaming processes each event as it arrives, with no concept of a run boundary. The pipeline is always running, continuously consuming events from a source (usually Kafka) and producing outputs with latency measured in milliseconds to seconds. There is no "start of batch" and no "end of batch" — just an infinite sequence of events flowing through the system.
The streaming data model — events, windows, and watermarks
Streaming introduces concepts that do not exist in batch processing. Every data engineer who works with streaming must understand these three concepts precisely before writing a single line of streaming code.
EVENT TIME: When the event actually happened (the timestamp in the event payload)
PROCESSING TIME: When the streaming system processed the event
These two times diverge whenever:
- The network is slow (event took 30 seconds to arrive)
- The device was offline (mobile app buffered events, flushed when reconnected)
- The system was under load (Kafka consumer fell behind)
- The event source has retries (event replayed with original timestamp)
EXAMPLE:
A FreshMart delivery agent marks an order "delivered" at 11:58 PM
on a bad network connection. The event reaches Kafka at 12:03 AM.
Event time: 2026-03-17 23:58:00 (when the tap happened)
Processing time: 2026-03-18 00:03:00 (when Kafka received it)
If your streaming pipeline counts "deliveries on 2026-03-17":
Using event time: counts this delivery correctly for March 17
Using processing time: counts this delivery for March 18 — WRONG
ALWAYS use event time for business metrics.
Only use processing time for system metrics (consumer lag, throughput).
WINDOWS:
Streaming aggregations operate over time windows — bounded periods
during which events are collected before computing results.
TUMBLING WINDOWS (non-overlapping, fixed size):
[00:00–01:00] [01:00–02:00] [02:00–03:00]
Each event belongs to exactly one window.
Use for: hourly/daily aggregates, session-independent metrics.
SLIDING WINDOWS (overlapping, fixed size, moves by step):
[00:00–01:00] [00:30–01:30] [01:00–02:00] (30-min step)
Each event may belong to multiple windows.
Use for: moving averages, rolling metrics.
SESSION WINDOWS (variable size, bounded by inactivity gap):
[events...gap > 30min...][events...gap > 30min...][events...]
Window closes when no events arrive for the gap duration.
Use for: user session analysis, visit duration.
WATERMARKS:
A watermark is the streaming system's current estimate of the
maximum event time it has seen, minus an allowed lateness.
watermark = max_event_time_seen - allowed_lateness
Purpose: tell the system when it is safe to close a window and
produce a result, even if late events might still arrive.
Example:
allowed_lateness = 5 minutes
max event time seen = 23:58:00
watermark = 23:53:00
Window [23:00–24:00] is not closed yet — events up to 5 min
late may still arrive.
Window [22:00–23:00] IS closed — no events older than 23:53
can legitimately arrive.
Too small watermark → windows close early → late events dropped → incorrect results
Too large watermark → windows close late → higher latency → more memory usedStreaming architecture — the components
STREAMING PIPELINE ARCHITECTURE:
EVENT SOURCE MESSAGE BROKER STREAM PROCESSOR SINK
─────────────────────────────────────────────────────────────────────────────
Payment service → Kafka topic: → Flink/Spark → Cassandra
(produces events) payments.v1 Streaming (real-time store)
→ Kafka topic:
enriched_payments
(downstream)
→ S3 Parquet
(data lake landing)
KAFKA CONSUMER GROUP MECHANICS:
- Multiple consumer instances in a group share the topic partitions
- Each partition is consumed by exactly one consumer at a time
- Offset tracks position: which events have been processed
- Auto-commit offset vs manual commit after successful processing
CONSUMER OFFSET MANAGEMENT:
# Manual offset commit (recommended for correctness):
consumer.poll(timeout_ms=1000)
for message in records:
process(message.value)
write_to_sink(message.value) # write BEFORE committing offset
consumer.commit() # commit AFTER successful write
# If write fails: do not commit → message reprocessed on next poll → at-least-once
# Auto-commit (default, simpler, less safe):
# Offset committed on a timer regardless of whether processing succeeded
# Risk: offset committed before write completes → message lost on crash
consumer.enable_auto_commit = True # do NOT use for financial data
SPARK STRUCTURED STREAMING (micro-batch under the hood, streaming API):
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, window, sum as spark_sum
from pyspark.sql.types import StructType, StringType, DecimalType, TimestampType
spark = SparkSession.builder.appName('payment_stream').getOrCreate()
payment_schema = StructType() .add('payment_id', StringType()) .add('amount', DecimalType(10, 2)) .add('store_id', StringType()) .add('event_time', TimestampType())
# Read from Kafka:
raw_stream = spark.readStream .format('kafka') .option('kafka.bootstrap.servers', 'kafka:9092') .option('subscribe', 'payments.v1') .option('startingOffsets', 'latest') .load()
# Parse JSON payload:
payments = raw_stream .select(from_json(col('value').cast('string'), payment_schema).alias('data')) .select('data.*') .withWatermark('event_time', '5 minutes')
# Aggregate: revenue per store per 1-hour tumbling window:
hourly_revenue = payments .groupBy(
window('event_time', '1 hour'),
'store_id',
) .agg(spark_sum('amount').alias('hourly_revenue'))
# Write to sink:
query = hourly_revenue.writeStream .outputMode('update') .format('delta') .option('checkpointLocation', 's3://freshmart-lake/checkpoints/hourly_revenue') .trigger(processingTime='30 seconds') .start('s3://freshmart-lake/silver/hourly_revenue')When streaming is actually required
Micro-Batch — Small Batches on a Short Interval
Micro-batch is batch processing applied to very short time intervals — typically 30 seconds to 15 minutes. Instead of processing a full day's data once per day, a micro-batch pipeline processes the last N minutes of data every N minutes. The result is much lower latency than daily batch while retaining most of the simplicity of batch processing.
This is how Spark Structured Streaming actually works internally: it is not true record-by-record streaming. It collects micro-batches of records from Kafka, processes each batch as a bounded Spark job, and outputs results at the configured trigger interval. Apache Flink is the only mainstream tool that does true record-by-record processing.
Micro-batch vs true streaming — the important distinction
MICRO-BATCH (Spark Structured Streaming, default):
t=0s: Collect all Kafka messages arrived in last 30 seconds
t=0.5s: Process as one Spark batch job (bounded)
t=2.3s: Write results to sink
t=2.3s: Commit Kafka offsets
t=30s: Collect next batch... repeat
Latency: ~30 seconds (trigger interval + processing time)
Throughput: Very high (Spark is optimised for large batches)
State: Managed per-batch via checkpoint
Strengths: High throughput, familiar Spark APIs, good recovery
Weakness: Minimum latency = trigger interval (cannot go below ~1 second practically)
TRUE STREAMING (Apache Flink):
Event arrives → Immediately processed → Output emitted
No waiting for batch boundary. Each record is processed as it arrives.
Latency: Milliseconds (end-to-end 10–200ms typical)
Throughput: Lower per-record efficiency, higher parallelism
State: Maintained continuously in distributed state store (RocksDB)
Strengths: True low-latency, native event time semantics, complex CEP
Weakness: More complex to operate, more expensive, harder to debug
CHOOSING:
Need < 1 second latency? → True streaming (Flink)
1 second – 15 minutes latency? → Micro-batch (Spark Structured Streaming)
> 15 minutes latency? → Batch (standard Spark or dbt)
Most "real-time" dashboards? → Micro-batch with 5-minute triggerMicro-batch implementation patterns
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StringType, DecimalType
spark = SparkSession.builder.getOrCreate()
stream = spark.readStream .format('kafka') .option('kafka.bootstrap.servers', 'kafka:9092') .option('subscribe', 'orders.v1') .load()
# TRIGGER OPTIONS — control micro-batch interval:
# Fixed interval micro-batch (most common):
query = stream.writeStream .trigger(processingTime='5 minutes') # process every 5 minutes
.format('delta') .option('checkpointLocation', 's3://freshmart-lake/checkpoints/orders_stream') .start('s3://freshmart-lake/bronze/orders_stream')
# Once trigger — process all available data right now, then stop:
# Useful for backfill or scheduled runs that want streaming semantics:
query = stream.writeStream .trigger(once=True) .format('delta') .start('s3://freshmart-lake/bronze/orders_stream')
# Available-now trigger (Spark 3.3+) — process all available data in batches:
query = stream.writeStream .trigger(availableNow=True) .format('delta') .start('s3://freshmart-lake/bronze/orders_stream')
# Continuous processing (experimental — approaches true streaming):
query = stream.writeStream .trigger(continuous='1 second') # checkpoint every 1 second
.format('console') .start()
# MICRO-BATCH WITH PLAIN PYTHON (no Spark — for simpler cases):
import time
from datetime import datetime, timedelta, timezone
def run_micro_batch(interval_seconds: int = 300) -> None:
"""Run a micro-batch loop that processes N minutes of data repeatedly."""
while True:
batch_end = datetime.now(timezone.utc)
batch_start = batch_end - timedelta(seconds=interval_seconds)
try:
rows = extract_from_source(batch_start, batch_end)
transformed = transform(rows)
upsert_to_destination(transformed) # upsert is critical here
print(f'Batch [{batch_start.isoformat()} - {batch_end.isoformat()}]: {len(rows)} rows')
except Exception as e:
print(f'Batch failed: {e}')
# Do NOT advance the time window — retry same window next loop
time.sleep(interval_seconds) # wait before next batchBatch vs Micro-Batch vs Streaming — Every Dimension
| Dimension | Batch | Micro-Batch | Streaming |
|---|---|---|---|
| Latency | Minutes to hours | Seconds to minutes | Milliseconds to seconds |
| Trigger | Schedule (cron/Airflow) | Time interval (every N minutes) | Each event arrival |
| Data model | Bounded dataset | Bounded micro-dataset | Unbounded event sequence |
| Execution | Run, complete, exit | Run, complete, wait, repeat | Always running |
| State management | Stateless (checkpoint only) | Micro-state per trigger | Full stateful processing |
| Late data handling | Natural — rerun the batch | Overlap windows + upsert | Watermarks + allowed lateness |
| Reprocessing | Trivial — rerun with date param | Replay from Kafka offset | Replay from Kafka offset |
| Debugging | Simple — fixed snapshot in time | Moderate | Complex — state, order, watermarks |
| Infrastructure | Ephemeral (start/stop per run) | Always-on (lower cost) | Always-on (higher cost) |
| Correctness | Easiest — all data available | Good with upsert + overlap | Hardest — requires watermarks |
| Best tools | Spark, dbt, Python, SQL | Spark Structured Streaming | Apache Flink, Kafka Streams |
| Best for | Finance reports, daily aggregates, warehouse loads | Near-realtime dashboards, hourly metrics, CDC | Fraud detection, live tracking, operational alerts |
Late-Arriving Data — The Correctness Challenge Unique to Streaming
Late-arriving data is the defining correctness challenge of stream processing. In batch processing, you simply run the pipeline after the data settles and everything is available. In streaming, events arrive out of order and delayed — and the system must decide when it is safe to close a window and produce a result, trading latency for completeness.
# ── BATCH: the trivial solution ──────────────────────────────────────────────
# Run with a delay to let late data arrive.
# The March 17 batch runs at 6 AM March 18 — 6+ hours after midnight.
# Any order timestamped March 17 but arriving late has 6 hours to arrive.
# Simple. Correct. No special logic needed.
# For very late data (hours or days late), run a correction batch:
# 0 10 * * * python3 pipeline.py --date yesterday --mode correction
# This reprocesses yesterday with all data that has arrived since the daily run.
# Upsert semantics make this safe.
# ── MICRO-BATCH: overlap windows + upsert ────────────────────────────────────
# Use overlapping time windows to catch most late arrivals.
# Upsert on event_id ensures duplicates from overlap are handled.
# If micro-batch interval is 5 minutes, query with 10-minute lookback:
def extract_with_overlap(batch_end, overlap_minutes=10):
batch_start = batch_end - timedelta(minutes=5 + overlap_minutes)
return db.query(
"SELECT * FROM events WHERE event_time BETWEEN %s AND %s",
(batch_start, batch_end),
)
# This reads events from 15 minutes ago to now.
# Events that arrived late (up to 10 minutes late) are included.
# Upsert at destination handles the re-processing of already-seen events.
# ── STREAMING (Spark): watermarks + allowed lateness ────────────────────────
from pyspark.sql.functions import window, sum as spark_sum
# Define allowed lateness of 10 minutes:
payments_with_watermark = payments .withWatermark('event_time', '10 minutes')
# ↑ Any event more than 10 minutes behind the watermark is considered late
# Aggregate with tumbling window:
hourly_revenue = payments_with_watermark .groupBy(
window('event_time', '1 hour'),
'store_id',
) .agg(spark_sum('amount').alias('revenue'))
# Output mode matters for late data:
# 'append': Output only when window is finalised (after watermark passes)
# → Lowest memory, highest latency, correct
# 'update': Output whenever window data changes (including late updates)
# → Lower latency, more updates to downstream
# 'complete': Output all windows on every trigger
# → Only for small datasets (memory grows unboundedly)
query = hourly_revenue.writeStream .outputMode('update') # emit updates as late data arrives
.format('delta') .option('checkpointLocation', 's3://checkpoints/hourly_revenue') .trigger(processingTime='1 minute') .start('s3://freshmart-lake/silver/hourly_revenue_stream')
# With update mode and upsert at destination (Delta MERGE):
# → Window results are updated as late events arrive
# → Final result is correct after watermark passes
# → Downstream consumers see intermediate updates (must handle them)
# ── THE LATE DATA DECISION TREE ───────────────────────────────────────────────
# How late does your data arrive? → How much lateness should you allow?
#
# 99th percentile latency < 5 min: allowed_lateness = 10 min
# 99th percentile latency < 30 min: allowed_lateness = 60 min
# Data can be hours late (mobile app offline): use batch with correction run
#
# Rule: if you cannot bound your data lateness to < 1 hour,
# streaming with watermarks becomes very expensive.
# A 1-hour watermark on 1-minute windows means holding 60 windows open
# in memory simultaneously. Use batch or micro-batch + correction instead.How to Choose the Right Processing Model
The wrong choice is almost always defaulting to streaming because it sounds more advanced or modern. Streaming adds significant operational complexity, infrastructure cost, and correctness challenges. Choose it only when the latency requirement genuinely requires it.
QUESTION 1: What is the maximum acceptable data latency for this use case?
< 1 second: True streaming only (Flink / Kafka Streams)
1s – 5 min: Micro-batch (Spark Structured Streaming, 30s trigger)
5 – 60 min: Micro-batch (5–15 minute trigger) or fast batch
> 1 hour: Batch (daily, hourly, or whatever interval fits)
COMMON MISTAKE: picking streaming because "real-time" sounds better.
Ask: what decision or action requires this latency? If the answer
is "a dashboard refresh" — does the user genuinely need sub-second
updates, or would 5-minute updates serve equally well?
QUESTION 2: How complex is the transformation logic?
Simple filtering/typing, no joins: All three work fine
Joins to slowly-changing dimensions: Batch or micro-batch (easier state)
Aggregations over large time windows: Batch (all data available at once)
Pattern detection across event sequence: Streaming (Flink CEP)
ML model inference per event: Streaming (low-latency requirement)
RULE: if the transformation requires data from multiple time periods
or large lookups, streaming state management becomes complex and
expensive. Batch makes this trivial.
QUESTION 3: How late can source events arrive?
< 5 minutes late: Micro-batch with 10-min lookback overlap
5–60 minutes late: Streaming with 60-min watermark OR micro-batch + correction
Hours late (mobile offline data, delayed batch feeds): Batch only
QUESTION 4: How complex can the operations model be?
Small team, no streaming expertise: Batch (always)
Team familiar with Spark Streaming: Micro-batch
Dedicated streaming engineers: True streaming if latency requires it
RULE: streaming pipelines require more engineering expertise to build,
more infrastructure to run, and more time to debug. Only introduce this
complexity when the latency requirement justifies it.
QUESTION 5: What is the data volume?
< 1 GB/day: Batch on a single machine (Pandas, not even Spark needed)
1 GB – 1 TB/day: Batch Spark or micro-batch Spark
> 1 TB/day: Micro-batch or streaming depending on latency needs
> 10 TB/day: Almost certainly micro-batch or streaming
PRACTICAL ROUTING TABLE:
Finance report (daily revenue, costs): BATCH
Operations dashboard (last 15 min): MICRO-BATCH (5 min)
Real-time fraud detection: STREAMING (Flink)
Customer segmentation (weekly): BATCH
Live delivery tracking: STREAMING
Hourly data quality check: MICRO-BATCH (10 min)
Monthly cohort retention analysis: BATCH
Payment gateway health monitoring: STREAMINGThe Kappa architecture — why Lambda is falling out of favour
The Lambda architecture (separate batch and streaming layers, merged at serving) was the dominant pattern in 2015–2020. It has fallen out of favour because maintaining two codebases for the same logic — one in batch SQL and one in streaming Java/Scala — doubles the engineering burden and introduces subtle correctness differences between the two paths.
LAMBDA ARCHITECTURE (2012–2020, now declining):
Source → Batch Layer (Spark, nightly) → Batch views (accurate, slow)
→ Speed Layer (Storm/Flink) → Real-time views (fast, approximate)
↓
Serving Layer (merge both)
Problems:
- Two codebases doing the same logic (batch SQL + streaming Java)
- Subtle differences between batch and stream results (bugs)
- Operational complexity of maintaining two stacks
- Speed layer results replaced by batch results once batch catches up
KAPPA ARCHITECTURE (2014–present, now dominant):
Source → Kafka (persistent event log, replayed for reprocessing)
→ Single streaming pipeline (Flink or Spark Streaming)
→ Serving layer (no separate batch layer)
Reprocessing: replay Kafka from the beginning with a new consumer group
Historical: Kafka retention configured for months/years of data
Benefits:
- Single codebase
- One source of truth
- Reprocessing is natural (replay Kafka)
- Simpler to operate
Limitation: requires Kafka to retain data long enough for full reprocessing
(expensive at high data volumes — terabytes stored in Kafka)
MODERN HYBRID (2022–present, most practical):
Source → Kafka (days of retention for streaming)
→ S3/ADLS (years of retention as Parquet, Delta Lake)
Streaming path: Kafka → Flink/Spark Streaming → real-time sink (low latency)
Batch path: S3 → Spark batch → warehouse (historical accuracy, bulk)
Both write to the same Delta Lake tables (upserts reconcile any differences)Designing FreshMart's Three-Layer Processing Architecture
FreshMart's CTO asks you to design the data processing architecture for three specific business requirements. Here is how a senior data engineer applies the decision framework to each:
Requirement 1 — Finance department needs a daily P&L report by 7 AM. The report covers all transactions from the previous day, reconciled against bank settlement data. It needs to be exact — finance cannot act on approximate data.
Decision: Batch, daily at 4 AM. Exact financial data requires all transactions to have settled before processing. A 4 AM run gives 4 hours for any late-arriving transactions to appear in the source before the report is needed at 7 AM. Streaming would add complexity without helping — finance does not act on real-time P&L, only on the final daily figure.
Requirement 2 — Operations team needs a dashboard showing store performance for the last 30 minutes, refreshed every 5 minutes.Store managers use this to respond to issues mid-day (slow service, high order cancellation rate).
Decision: Micro-batch, 5-minute trigger. 5-minute latency is perfectly adequate for a manager responding to operational issues. A Spark Structured Streaming job reads from Kafka (where order events are published as they occur), aggregates the last 30 minutes of data every 5 minutes, and writes to a Delta table that the dashboard queries. This is far simpler than true streaming and adequate for the use case.
Requirement 3 — Risk team needs real-time payment fraud detection — block suspicious transactions before they complete.A transaction that requires investigation must be flagged within 2 seconds of the payment event.
Decision: True streaming, Apache Flink. The 2-second latency requirement eliminates both batch and micro-batch. Flink is chosen over Spark Streaming because Flink's native event time processing and sub-second latency per record meet the requirement. The Flink job consumes from Kafka, computes a velocity score (transactions per card in last 60 seconds), and publishes a fraud signal back to Kafka where the payment service reads it before completing the transaction.
Three requirements, three different processing models. All three run in the same data platform, sharing the same Kafka cluster and the same Delta Lake. This is the architecture of a mature, fit-for-purpose data platform — not one model for everything, but the right model for each need.
5 Interview Questions — With Complete Answers
Errors You Will Hit — And Exactly Why They Happen
🎯 Key Takeaways
- ✓Batch processing runs on a schedule, processes bounded data, exits, and waits for the next scheduled interval. It is simple, correct, cheap, and the right default for the majority of data engineering workloads. Choose batch unless there is a specific latency requirement that batch cannot meet.
- ✓Streaming processing handles each event continuously as it arrives with no run boundary. It is always running, has millisecond-to-second latency, but is significantly more complex and expensive to operate. Choose streaming only when the latency requirement is genuinely below what micro-batch can provide.
- ✓Micro-batch is batch processing at short intervals (30 seconds to 15 minutes). Spark Structured Streaming is micro-batch under the hood — it collects records per trigger interval and processes them as bounded Spark jobs. Most "real-time" dashboard use cases are best served by micro-batch.
- ✓Event time is when the event happened (the timestamp in the event payload). Processing time is when the pipeline processed it. Always use event time for business metrics — processing time produces wrong results when events arrive late or out of order.
- ✓Watermarks tell a streaming system when to close a time window despite possible late arrivals. watermark = max_event_time_seen - allowed_lateness. Smaller allowed lateness = lower latency but late events are dropped. Larger allowed lateness = more complete results but higher latency.
- ✓Late data handling: in batch, run the pipeline with a delay and optionally run correction batches (upserts handle re-processing cleanly). In micro-batch, use overlapping windows plus upserts. In streaming, use watermarks and allowed lateness. If data can be hours late, do not use streaming — use batch with correction.
- ✓The decision framework: latency < 1s → true streaming (Flink); 1s–15min → micro-batch (Spark Structured Streaming); > 15min → batch. Always ask what business decision requires this latency before choosing streaming.
- ✓Spark Structured Streaming is micro-batch, not true streaming. Apache Flink is true record-by-record streaming. For sub-second latency requirements, only Flink (or Kafka Streams) qualifies. For 1-minute or slower latency, Spark Structured Streaming is simpler and adequate.
- ✓Lambda architecture (separate batch and streaming layers merged at serving) is declining because it requires two codebases for the same logic. The Kappa architecture (single streaming pipeline, replay Kafka for reprocessing) and the modern hybrid (streaming + batch writing to the same Delta Lake tables) are the current standards.
- ✓Mature data platforms use all three models simultaneously: batch for finance reports, micro-batch for operations dashboards, streaming for fraud detection. The skill is matching the model to the latency requirement — not picking one model for everything.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.