When data is processed is one of the fundamental distinctions in data processing. This distinction has important implications for architecture and design of both systems that process data as well as applications that use and depend on those systems.
Most simply stated, this fundamental distinction is about whether each piece of new data is processed when it arrives, or instead with a group of arriving data. That distinction divides processing into two categories: batch processing and stream processing.
In batch processing, newly arriving data elements are collected into a group. The whole group is then processed at a future time (as a batch, hence the term “batch processing”). Exactly when each group is processed can be determined in a number of ways–for example, it can be based on a scheduled time interval (e.g. every five minutes, process whatever new data has been collected) or on some triggered condition (e.g. process the group as soon as it contains five data elements or as soon as it has more than 1MB of data).
By way of analogy, batch processing is like your friend (you certainly know someone like this) who takes a load of laundry out of the dryer and simply tosses everything into a drawer, only sorting and organizing it once it becomes too hard to find something. This person avoids doing the work of sorting each time they do laundry, but with the tradeoff that they spend a lot of time searching through the drawer whenever they need something and ultimately need to spend a significant amount of time separating clothes, matching socks, etc. when it becomes too difficult to find things.
Historically, the vast majority of data processing technologies have been designed for batch processing. Traditional data warehouses and Hadoop are two common examples of systems focused on batch processing.
The term “microbatch” is frequently used to describe scenarios where batches are small and/or processed at small intervals. Even though processing may happen as often as once every few minutes, data is still processed a batch at a time. Spark Streaming is an example of a system that supports micro-batch processing.
In stream processing, each new piece of data is processed when it arrives. Unlike batch processing, there is no waiting until the next batch processing interval and data is processed as individual pieces rather than being processed a batch at a time.
Although each new piece of data is processed individually, many stream processing systems do also support “window” operations that allow processing to also reference data that arrives within a specified interval before and/or after the current data arrived.
Carrying forward our analogy, a stream processing approach to organizing laundry would sort, match, and organize laundry as it is taken out of the dryer. In this approach, a bit more work is done initially for each load of laundry, but there is no longer a need to go back and reorganize the entire drawer at a later time because it is already organized, and time is no longer wasted searching through the drawer for buried pieces of clothing.
There are a growing number of systems designed for stream processing including Apache Storm and Heron. These systems are frequently deployed to support real-time processing of events.
Implications of the differences between batch and stream processing
Although it may seem that the differences between stream processing and batch, especially micro-batch, are just a matter of a small difference in timing, they actually have fundamental implications for both the architecture of data processing systems and for the applications using them.
Systems for stream processing are designed to respond to data as it arrives. That requires them to implement an event-driven architecture, an architecture in which the internal workflow of the system is designed to continuously monitor for new data and dispatch processing as soon as that data is received. On the other hand, the internal workflow in a batch processing system only checks for new data periodically, and only processes that data when the next batch window occurs.
The difference between stream and batch processing is also significant for applications. Applications built for batch processing by definition process data with a delay. In a data pipeline with multiple steps, those delays accumulate. In addition, the lag between arrival of new data and the processing of that data will vary depending on the time until the next batch processing window–from no time at all in some cases to the full time between batch windows for data that arrives just after the start of processing of a batch. As a result, batch processing applications (and their users) cannot rely on consistent response times and need to adjust accordingly to that inconsistency and to greater latency.
Batch processing is generally appropriate for use cases where having the most up-to-date data is not important and where tolerance for slower response time is higher. For example, offline analysis of historical data to compute results or identify correlations is a common batch processing use case.
Stream processing, on the other hand, is necessary for use cases that require live interaction and real-time responsiveness. Financial transaction processing, real-time fraud detection, and real-time pricing are examples that best fit stream processing.