Batch vs. Streaming — The Decision Framework Every Data Engineer Needs
Most data engineering architecture decisions come down to one question: batch or streaming? Get this wrong and you build an overly complex streaming system when a simple batch job would have worked fine — or you build a batch pipeline that cannot meet the business latency requirement.
The core difference
Batch processing collects data over a period of time and processes it all at once on a schedule. Run at 2am, process all of yesterday's data, write results, done.
Stream processing processes data continuously as it arrives, event by event, with latency measured in milliseconds to seconds.
The key insight: streaming is not always better. It is more complex, more expensive, and harder to debug. You should only use streaming when the business genuinely requires low latency.
The decision framework — ask these 4 questions
1. What is the required latency?
- Hours or days: batch is fine
- Minutes: micro-batch (Spark Structured Streaming with short windows)
- Seconds or milliseconds: true streaming required
2. What is the data volume?
- High volume, low frequency: batch wins
- Low volume, high frequency: streaming is manageable
- High volume, high frequency: streaming is expensive — challenge the business requirement
3. How complex is the transformation?
- Simple aggregations: both work
- Joins across multiple streams: streaming becomes very complex, consider if batch solves the problem
4. What happens if you are 1 hour late?
- Business critical: streaming
- Reporting and analytics: batch is almost always fine
Real examples of each
Clear batch use cases:
- Daily sales reporting (nobody needs this at 3am with 10ms latency)
- Monthly customer churn analysis
- Weekly ETL from operational databases to data warehouse
- End-of-day financial reconciliation
Clear streaming use cases:
- Fraud detection on credit card transactions (must decide in milliseconds)
- Real-time inventory tracking during flash sales
- Live sports scores and statistics
- Industrial sensor monitoring for equipment failure
What most companies actually use
The honest reality: about 80% of data engineering work in most companies is batch processing. Streaming gets a lot of attention at conferences but the majority of pipelines that run in production at real companies — especially mid-size companies — are scheduled batch jobs.
Learn batch deeply first. Understand streaming conceptually. Build streaming knowledge once you have your first job and face a real streaming requirement.
For interviews: be able to explain the tradeoffs clearly. Most interviewers ask about streaming to test your judgment — they want to know if you would reach for streaming unnecessarily, not whether you can implement Kafka.