登录查看更多内容

The rhetoric and reality of Big Data Processing

Aisha Ekundayo, PhD

Data Analytics Consulting | AI Consultant | Data Product Management

发布日期: 2021年4月27日

The previous article concluded that Big Data could not be stored and processed using traditional methods such as a single computer. We concluded that to process Big Data, we need to use distributed processing technology such as Azure Synapse and Databricks. Likewise, a robust storage solution is required to store large data comprising of structured, semi-structured and unstructured data types.

Technologies offered by Cloud Computing companies, including AWS, Google and Microsoft are built for such scenarios. This article aims to discuss how big data is stored and processed in the cloud. We start with a simple definition of cloud computing, how distributed computing works and types of input data in a distributed computing framework.

According to Microsoft Learn, Cloud Computing is defined as computing services accessed over the internet, also known as the cloud. Computing services include tools like databases, servers, storage, software, analytics and networking. More companies are moving on-premises infrastructure into the cloud to benefit from its agility, reliability, scalability, elasticity, geographical distribution, and disaster discovery.

Users can access big Data computing services such as distributed computing over the internet on Azure Cloud. With Azure Synapse, companies have a unified platform for integrating data from various sources, a data warehouse for storing data and parallel processing capabilities for analysing all the data in one place. Thus, offering an End-to-End capability for collaboration, development, and management of your system.

Azure Synapse can be used to explore and prepare big data for business intelligence and machine learning using SQL or Spark engines, depending on the use-case and the data characteristics (remember the three V’s - volume, variety & velocity).

There are storage solutions with capabilities to support big data analytics. The solutions work with different data types, and the raw data can be stored, explored and prepared for algorithms to generate insights. For instance, Azure Datalake Gen2 has built-in functions to support exabytes of data. Therefore, as part of your big data warehouse architecture design, it is crucial to choose the best storage solution for your business problem and use case.

Distributed computing, also known as parallel processing, is a type of computing architecture where several processors’ computes the operation simultaneously. In the Big Data world, parallel computing technology divides big data into different chunks and simultaneously processes them. We can only achieve this through advanced computing resources such as Apache Spark, Blockchain and Hadoop.

Both SQL and Spark enable parallel processing of data depending on the number of nodes provisioned with serverless options for automatic scaling. We will look more into Data Warehouse Units (DWUs) in another article to explore how to balance the cost and performance of your system. In other words, what is the right compute configuration for a big data system?

Data processing could be for batch or stream data depending on how data is generated and ingested into your big data platform. Batch data is data that is available for processing at the same time. Think about monthly utility bills, where usage over the month is calculated to compute total usage and the bill for that month. Such data is processed as part of a batch job.

Stream data is produced continuously and requires real-time processing to process the data as it is generated. An example of stream processing is your Fitbit generating your heart rate; the device constantly recalculates the figure as new data becomes available every second. To gain a competitive advantage, enterprises use both batch and stream data processing for various parts of the business.

To conclude, based on the points above, parallel computing involves processing big data in batches or in real-time as the data arrives using compute engines in the cloud. Hence, companies can focus on innovation and ensuring all employees are empowered to make data-driven decisions to accelerate business growth through business intelligence platforms. Also, note there are community editions of big data technology; however, they do not have the right SLAs, security, management tools, etc., required for an enterprise analytics platform.

In the next article, we will look at the common use-cases (i.e. examples) of big data analytics across industries.

Fatai Jimoh, PhD

Data Scientist/ Data Engineer/ Researcher

3 年

Interesting and concise.

1 次回应

Navesh Kumar

Energy Trading | Algorithmic Trading

3 年

Very interesting and a very important topic that should be looked keenly upon.

1 次回应

查看更多评论

要查看或添加评论，请登录

Aisha Ekundayo, PhD的更多文章

Diary of a Data Product Manager: Considerations for a solid technology foundation

2023年10月13日

Diary of a Data Product Manager: Considerations for a solid technology foundation

Thank you for visiting my blog. In a previous blog, I presented the case for having a well-defined data strategy, data…
Diary of a Data Product Manager: A case for data strategy, product definition and stewardship.

2023年9月29日

Diary of a Data Product Manager: A case for data strategy, product definition and stewardship.

I want to share some lessons learned as a data product manager who has managed data products for the past few years…

3 条评论
Classic Approach and Modern Reality interaction for enhanced Big Data Applications across Industries

2021年5月13日

Classic Approach and Modern Reality interaction for enhanced Big Data Applications across Industries

As highlighted in the previous article, business processes generate a vast amount of structured and unstructured data…
The Ascendancy of Big Data…

2021年4月11日

The Ascendancy of Big Data…

Advancement in technology recorded in the past two decades has led to a significant increase in data generated with…

3 条评论
A broad overview of machine learning models and its classifications

2019年6月28日

A broad overview of machine learning models and its classifications

A general definition of Machine Learning (ML) is “the art and science of programming computers so they can learn from…
How to plan a successful Data Science Project

2019年6月28日

How to plan a successful Data Science Project

Planning is essential in all projects to ensure that the strategic objectives of the project are accomplished. It is…
Five apparent reasons to study Economics

2019年6月28日

Five apparent reasons to study Economics

This blog is for young people, school leavers thinking about a degree and for those thinking about a career change. For…
Automate your machine learning models to maximise business benefits

2019年6月28日

Automate your machine learning models to maximise business benefits

As a data scientist, your work life will focus on several aspects of optimal use of data including deriving value from…

1 条评论
Tenets of Good Work and Why They Matter for a Happier Society

2018年11月28日

Tenets of Good Work and Why They Matter for a Happier Society

With record employment level in most G7 countries and the increase in economic growth, little attention is given to…
Increasing your ROI: How recruitment technology and analytics can drive performance and growth

2018年11月7日

Increasing your ROI: How recruitment technology and analytics can drive performance and growth

Majority of business leaders agree that embracing digital transformation and automation is key to both short-term and…

See all articles

The rhetoric and reality of Big Data Processing

Aisha Ekundayo, PhD

Data Analytics Consulting | AI Consultant | Data Product Management

Aisha Ekundayo, PhD的更多文章

社区洞察

其他会员也浏览了

Cloud & Data Metamorphosis, Part 3.3

Understanding AWS S3 Directory Buckets

Databricks Solutions on AWS, Azure and GCP

Time series (Tick) Databases with Native Cloud Technologies and Data Validation

Comparing Big Data Pipelines on AWS, Microsoft Azure, and Google Cloud Platform

Big Data - AWS, Azure, GCP Offerings

Reading from Azure DataLake & Writing to Google BigQuery via Databricks

Topics – The Redpanda Newsletter (Issue #023)

Real-Time Data in the Cloud: Engineering with Apache Kafka

Aisha Ekundayo, PhD的更多文章

Diary of a Data Product Manager: Considerations for a solid technology foundation

Diary of a Data Product Manager: A case for data strategy, product definition and stewardship.

Classic Approach and Modern Reality interaction for enhanced Big Data Applications across Industries

The Ascendancy of Big Data…

A broad overview of machine learning models and its classifications

How to plan a successful Data Science Project

Five apparent reasons to study Economics

Automate your machine learning models to maximise business benefits

Tenets of Good Work and Why They Matter for a Happier Society

Increasing your ROI: How recruitment technology and analytics can drive performance and growth

社区洞察

其他会员也浏览了

Cloud & Data Metamorphosis, Part 3.3

Understanding AWS S3 Directory Buckets

Databricks Solutions on AWS, Azure and GCP

Time series (Tick) Databases with Native Cloud Technologies and Data Validation

Comparing Big Data Pipelines on AWS, Microsoft Azure, and Google Cloud Platform

Big Data - AWS, Azure, GCP Offerings

Reading from Azure DataLake & Writing to Google BigQuery via Databricks

Topics – The Redpanda Newsletter (Issue #023)

Real-Time Data in the Cloud: Engineering with Apache Kafka