The rhetoric and reality of Big Data Processing

The rhetoric and reality of Big Data Processing

The previous article concluded that Big Data could not be stored and processed using traditional methods such as a single computer. We concluded that to process Big Data, we need to use distributed processing technology such as Azure Synapse and Databricks. Likewise, a robust storage solution is required to store large data comprising of structured, semi-structured and unstructured data types. 

Technologies offered by Cloud Computing companies, including AWS, Google and Microsoft are built for such scenarios. This article aims to discuss how big data is stored and processed in the cloud. We start with a simple definition of cloud computing, how distributed computing works and types of input data in a distributed computing framework. 

According to Microsoft Learn, Cloud Computing is defined as computing services accessed over the internet, also known as the cloud. Computing services include tools like databases, servers, storage, software, analytics and networking. More companies are moving on-premises infrastructure into the cloud to benefit from its agility, reliability, scalability, elasticity, geographical distribution, and disaster discovery.  

Users can access big Data computing services such as distributed computing over the internet on Azure Cloud. With Azure Synapse, companies have a unified platform for integrating data from various sources, a data warehouse for storing data and parallel processing capabilities for analysing all the data in one place. Thus, offering an End-to-End capability for collaboration, development, and management of your system.

Azure Synapse can be used to explore and prepare big data for business intelligence and machine learning using SQL or Spark engines, depending on the use-case and the data characteristics (remember the three V’s - volume, variety & velocity). 

There are storage solutions with capabilities to support big data analytics. The solutions work with different data types, and the raw data can be stored, explored and prepared for algorithms to generate insights. For instance, Azure Datalake Gen2 has built-in functions to support exabytes of data. Therefore, as part of your big data warehouse architecture design, it is crucial to choose the best storage solution for your business problem and use case. 

Distributed computing, also known as parallel processing, is a type of computing architecture where several processors’ computes the operation simultaneously. In the Big Data world, parallel computing technology divides big data into different chunks and simultaneously processes them. We can only achieve this through advanced computing resources such as Apache Spark, Blockchain and Hadoop. 

Both SQL and Spark enable parallel processing of data depending on the number of nodes provisioned with serverless options for automatic scaling. We will look more into Data Warehouse Units (DWUs) in another article to explore how to balance the cost and performance of your system. In other words, what is the right compute configuration for a big data system?

Data processing could be for batch or stream data depending on how data is generated and ingested into your big data platform. Batch data is data that is available for processing at the same time. Think about monthly utility bills, where usage over the month is calculated to compute total usage and the bill for that month. Such data is processed as part of a batch job.

Stream data is produced continuously and requires real-time processing to process the data as it is generated. An example of stream processing is your Fitbit generating your heart rate; the device constantly recalculates the figure as new data becomes available every second. To gain a competitive advantage, enterprises use both batch and stream data processing for various parts of the business.

To conclude, based on the points above, parallel computing involves processing big data in batches or in real-time as the data arrives using compute engines in the cloud. Hence, companies can focus on innovation and ensuring all employees are empowered to make data-driven decisions to accelerate business growth through business intelligence platforms. Also, note there are community editions of big data technology; however, they do not have the right SLAs, security, management tools, etc., required for an enterprise analytics platform. 

In the next article, we will look at the common use-cases (i.e. examples) of big data analytics across industries. 

Fatai Jimoh, PhD

Data Scientist/ Data Engineer/ Researcher

3 年

Interesting and concise.

Navesh Kumar

Energy Trading | Algorithmic Trading

3 年

Very interesting and a very important topic that should be looked keenly upon.

要查看或添加评论,请登录

Aisha Ekundayo, PhD的更多文章

社区洞察

其他会员也浏览了