Big Data Processing, Streaming vs Batching
Davi Abdallah
Tech Manager, AI & Data Lakehouse Principal Architect, and Distributed Parallel Data Processing Expert working closely with Data Science, Cloud Engineering, and Dev Ops Teams.
Batch data processing is a very efficient way to process high volumes of data in a group of transactions that is collected over a period of time. Data is collected, processed, and then the batch results are outputted (Apache Spark, an open-source distributed general-purpose cluster-computing framework, is a batch processing driven tool). Batch processing requires separate code for input, process, and output. As an analogy Payroll and Billing processing are similar to batch data processing because they occur in a recurring cycle with a time limited scope. In contrast, streaming data processing requires continual input, processing, and output of the data. The data must be processed in a small time period (or near real time). Radar systems, which constantly update location data, feeding a dashboard with several planes on it, and bank ATMs are some examples as the data must stream constantly through these systems.
Generally organizations use batch processing for several data pipelines, but streaming can be used to gain near real time insights. This allows for much faster reaction time, although it comes with added computing costs. Event processing and operational intelligence use streaming data processing to gain insight into operations by running query analysis against live feeds and event data. The idea of “Operational Intelligence” is about creating “near real time analytics” and providing visibility over diverse data sources. The goal is to obtain “near real time” insights using continuous analytics to allow the organization to take immediate action when important events occur. In contrast this with business intelligence, which entails descriptive or historical analysis of operational data.
Decisiv uses Batch data processing in situations where real-time analytics results are not needed, specially when it is more important to process large volumes of information than it is to get fast analytics results (although data streams can involve “big” data, too — batch processing is not a strict requirement for working with large amounts of data).
The transportation industry is a good example of how field asset monitoring can be best utilized. Sensors can be deployed on any asset (trucks, buses, taxis or turnstiles in a subway). Data can be aggregated about the current position of a vehicle, the load weight on a truck, the number of people in a bus or in a queue waiting for a bus in a single place. Visualizing this data in real time becomes a smart tool to help traffic dispatchers optimize traffic and expenses.
In comparison, Payroll systems are a good example of Batch Data Processing. Payroll transactions are processed in a recurring, time limited cycle. The main advantages are the following:
- Payrolls batches are repeated jobs processed fast
- There is no need for additional hardware and system support to input data
- Small and large organizations take advantage of processing Payroll in batches
- One single batch system processes Payroll for multiple employees
- Repeated work is managed easily and less idle time is required
Batch and Real Time data processing both have advantages and disadvantages. The decision to select the best data processing system for the specific job at hand depends on the types and sources of data, and processing time needed to get the job done. Each organization needs develop a strategy to identify how to process its data sources (Batching or Streaming) which aligns with overall company goals.