登录查看更多内容

Big Data Processing, Streaming vs Batching

Davi Abdallah

Tech Manager, AI & Data Lakehouse Principal Architect, and Distributed Parallel Data Processing Expert working closely with Data Science, Cloud Engineering, and Dev Ops Teams.

发布日期: 2017年10月30日

Batch data processing is a very efficient way to process high volumes of data in a group of transactions that is collected over a period of time. Data is collected, processed, and then the batch results are outputted (Apache Spark, an open-source distributed general-purpose cluster-computing framework, is a batch processing driven tool). Batch processing requires separate code for input, process, and output. As an analogy Payroll and Billing processing are similar to batch data processing because they occur in a recurring cycle with a time limited scope. In contrast, streaming data processing requires continual input, processing, and output of the data. The data must be processed in a small time period (or near real time). Radar systems, which constantly update location data, feeding a dashboard with several planes on it, and bank ATMs are some examples as the data must stream constantly through these systems.

Generally organizations use batch processing for several data pipelines, but streaming can be used to gain near real time insights. This allows for much faster reaction time, although it comes with added computing costs. Event processing and operational intelligence use streaming data processing to gain insight into operations by running query analysis against live feeds and event data. The idea of “Operational Intelligence” is about creating “near real time analytics” and providing visibility over diverse data sources. The goal is to obtain “near real time” insights using continuous analytics to allow the organization to take immediate action when important events occur. In contrast this with business intelligence, which entails descriptive or historical analysis of operational data.

Decisiv uses Batch data processing in situations where real-time analytics results are not needed, specially when it is more important to process large volumes of information than it is to get fast analytics results (although data streams can involve “big” data, too — batch processing is not a strict requirement for working with large amounts of data).

The transportation industry is a good example of how field asset monitoring can be best utilized. Sensors can be deployed on any asset (trucks, buses, taxis or turnstiles in a subway). Data can be aggregated about the current position of a vehicle, the load weight on a truck, the number of people in a bus or in a queue waiting for a bus in a single place. Visualizing this data in real time becomes a smart tool to help traffic dispatchers optimize traffic and expenses.

In comparison, Payroll systems are a good example of Batch Data Processing. Payroll transactions are processed in a recurring, time limited cycle. The main advantages are the following:

Payrolls batches are repeated jobs processed fast
There is no need for additional hardware and system support to input data
Small and large organizations take advantage of processing Payroll in batches
One single batch system processes Payroll for multiple employees
Repeated work is managed easily and less idle time is required

Batch and Real Time data processing both have advantages and disadvantages. The decision to select the best data processing system for the specific job at hand depends on the types and sources of data, and processing time needed to get the job done. Each organization needs develop a strategy to identify how to process its data sources (Batching or Streaming) which aligns with overall company goals.

要查看或添加评论，请登录

Davi Abdallah的更多文章

start-notebook-in-specific-cluster

2024年5月23日

start-notebook-in-specific-cluster

Code repo: databricks-utils/start-notebook-in-specific-cluster/start-notebook-in-specific-cluster.py Start notebook in…

2 条评论
Topics in AI

2024年5月8日

Topics in AI

Artificial Intelligence (AI) is a dynamic field encompassing a multitude of topics, from machine learning algorithms to…
Parallel Computing Explained

2024年5月8日

Parallel Computing Explained

What is Parallel computing? Parallel computing refers to a computational approach where multiple tasks are executed…
databricks-utils/local_data_file_to_db_dataframe

2023年10月27日

databricks-utils/local_data_file_to_db_dataframe

local_data_file_to_db_dataframe Upload a CSV file to your Databricks personal workspace Copy the code from…
Databricks - run notebooks in parallel

2023年4月4日

Databricks - run notebooks in parallel

Run notebooks in parallel Import…
update-csv-using-delta databricks open source tool for notebook dependency management

2022年12月29日

update-csv-using-delta databricks open source tool for notebook dependency management

Update CSV using DELTA (SQL) in a databricks notebook Import…
databricks-dependency-management open source tool for notebook dependency management

2022年12月7日

databricks-dependency-management open source tool for notebook dependency management

databricks-dependency-management Control notebook dependency management in a databricks notebook (run notebooks in…
Open source tool udf-databricks-cluster-multiprocessing to run code in multiprocessing mode using databricks clusters

2022年12月7日

Open source tool udf-databricks-cluster-multiprocessing to run code in multiprocessing mode using databricks clusters

udf-databricks-cluster-multiprocessing Run user difned function (udf) on databricks clusting in multiprocessing mode…
databricks-search-column-tool an open source notebook to search Databricks catalog

2022年7月12日

databricks-search-column-tool an open source notebook to search Databricks catalog

databricks-search-column-tool search column tool (all databases or selected databases) for databricks Create a new…
Is your company's Big Data strategy in line?

2017年10月5日

Is your company's Big Data strategy in line?

Companies are increasingly trying to leverage analytics in new and uncharted territories, trying to adapt their…

2 条评论

See all articles

Big Data Processing, Streaming vs Batching

Davi Abdallah

Tech Manager, AI & Data Lakehouse Principal Architect, and Distributed Parallel Data Processing Expert working closely with Data Science, Cloud Engineering, and Dev Ops Teams.

Davi Abdallah的更多文章

社区洞察

其他会员也浏览了

CxO, ESG, Big Data, DevOps, Careers, NVIDIA, IBM, CxO Events (321.2.1) Monday AM

Lithium: Dynamic, Self Hosted, and Distributed Ephemeral Streaming Pipelines

Complex Tools And Best Practices For Building Event-Driven Architectures

How to Handle Data Consistency in a Microservices Environment

March 2023: What do you think of the name CockroachDB? And other stories…

Replatforming HR with AI-Generated Code

System design Concepts Part:-4

AWS Step Functions: Use Cases and Best Practices

Transactional Outbox Pattern?-?Distributed Design?Patterns

etcd in Kubernetes: Distributed Configuration Management

Davi Abdallah的更多文章

start-notebook-in-specific-cluster

Topics in AI

Parallel Computing Explained

databricks-utils/local_data_file_to_db_dataframe

Databricks - run notebooks in parallel

update-csv-using-delta databricks open source tool for notebook dependency management

databricks-dependency-management open source tool for notebook dependency management

Open source tool udf-databricks-cluster-multiprocessing to run code in multiprocessing mode using databricks clusters

databricks-search-column-tool an open source notebook to search Databricks catalog

Is your company's Big Data strategy in line?

社区洞察

其他会员也浏览了

CxO, ESG, Big Data, DevOps, Careers, NVIDIA, IBM, CxO Events (321.2.1) Monday AM

Lithium: Dynamic, Self Hosted, and Distributed Ephemeral Streaming Pipelines

Complex Tools And Best Practices For Building Event-Driven Architectures

How to Handle Data Consistency in a Microservices Environment

March 2023: What do you think of the name CockroachDB? And other stories…

Replatforming HR with AI-Generated Code

System design Concepts Part:-4

AWS Step Functions: Use Cases and Best Practices

Transactional Outbox Pattern?-?Distributed Design?Patterns

etcd in Kubernetes: Distributed Configuration Management