登录查看更多内容

Learning to tackle Big Data problems the right way

YASH MAHENDRA JOSHI

Associate Manager@Accenture France | Data Engineering@Michelin Manufacturing | Professor for Spark, Big Data & Machine Learning

发布日期: 2021年11月17日

We live in a world of exponentially increasing data. Data that is growing in volumetry, in velocity, and in variety. And radically so. Today, the ability to tackle problems of how to source this data, store this data, and most of all, make sense of this data, can get you a job in some of the world's best companies, be it in marketing, finance , supply chain or any other domain, simply because data is a universal need, and a fundamental building block of the Data-Information-Knowledge-Wisdom hierarchy.

If we jump back in time, in the 1980s and 1990s entered in the concept of Business Intelligence : a domain that is meant for enterprises to organise their data across domains , being able to sift through vast amounts of data AND to be able to have a single version of the truth. It's a domain that is still alive today : in BI schools, we learn here the ways of ETL with Talend and Datastage, and of the Enterprise Data Warehouse with structured data and its staging processes, and possibly most importantly, the ability to harness data with BI tools like Microstrategy / Power BI/ Qlik . But with exponentially growing Big Data, we find it difficult to store unstructured data in Data Warehouses , which are by the way not the ideal solution if you are trying to extract insights with Machine Learning.

Machine learning requires data : and large amounts of it. A number of studies have shown that sometimes just having more data makes for not having the best algorithm : in fact, some competitions have been won by teams that simply have access to MORE, or BETTER QUALITY data. The same applies to a sub-domain of Machine learning known as Deep Learning : training a neural network on 1000 data points versus 10000000 data points makes a tremendous difference to the predictive power of a neural network . Hence, the need was born for data lakes based on HDFS (Hadoop Distributed File System) : to be able to store enormous amount of data, at scale, in the cloud, and cheaply. Wow ! So that's our big data problems solved , right ? Not quite ....

Just having access to enormous quantities of data doesn't do the trick: quality of data, through ACID transactions for example (Atomicity, Consistency, Isolation and Durability) becomes critical to data teams running production workloads. Furthermore, data teams need access to this data at increasingly fast speeds. Hence , the concept of Cloud DataWarehouses like Snowflake, offering enormous power of scalability while structuring data in tables, or of the Delta Lake proposed by Databricks, including the famous concept of the Data Lakehouse, have now come into vogue. But even with the enormous power offered by the clusters of Databricks, or of the VMs of Snowflake, we still have many challenges , notably:

How to better distribute our data so that the calculation can be made in parallel ?
How to bring enormous amounts of data into memory so that we don't have to use Disk and Network ?
How to orchestrate our treatment so we don't waste computation resources ?
And how to do this all in a manner that makes maintainance and support easy ?

With the immense amount of choice we have today with tools, technologies and ideas : compute intensive versus memory intensive for instance, making these choices is not always easy.

We need to go back to basics , and learn the building blocks of Hadoop and MapReduce well. These old-school open-source methods of distrbuted storage and distributed calculation have certain principles that can help us better understand the nature of distributed computing today, and help us to tackle Big Data problems. I often tell my students in Big Data or my colleagues at work to remember the principles of Hadoop : D.R.O => Distribution and repartitions on machines of a cluster, Replication in order to improve reliability , and Optimisation through colocalisation of data and treatments.

领英推荐

TransmogrifAI

360DigiTMG 1 年前

Machine Learning and Big Data: Are They the Future?

Analytics Insight? 8 个月前

A comprehensive curriculum for AI architects…

Javid Ur Rahaman 2 个月前

Take the example of a wordcount in Spark today, one of the most in vogue tools for in memory Big Data processing. If you do a wordcount in Spark, and look underneath the hood at the DAG visualisations, you will notice in fact that is built in two stages : a MAP stage and a REDUCE stage. There is a need to shift from Stage 0 to Stage 1 simply because the reduce operation needs to collect all the data from the map operation onto one node before it can process this data. This is fundamentally the same as the MAP-REDUCE paradigm. Of course, this is a simplified example, but we can use the understanding of the map reduce paradigm to help us tackle in the future more complex DAG visualisations and Big data problems. For example, the below complex DAG graph while running Alternating Least Squares algorithm on a spark cluster.

My point in essence is: we need to go back to building blocks in order to better treat Big Data. We need to look under the hood of big data tools, and try to reverse-engineer them, all while trying to learn the theories of how these tools came into place initially.

Thanks for reading to the end, and hope you found this article interesting . Give it a thumbs up or share if you did.

Reem Al Rabya

Property Specialist | Trusted Advisor | Client-Centered Solutions

3 年

Congratulations on your first course!

1 次回应

Michael W.

Development and Architecture (8 years), Creator of ONE-FRONT stack & community "santeJS" > sudo hire me

3 年

Thanks for sharing, I just think now that unorganized datastructres, after some time when left not maintained are perhaps sold cheap for scam call centers. Data processing scripts, well defined models, well implemented validation phases are hot, and fixing the data lake durring the data storms will be more complicated over the time. I would suggest defining strict models, followed up with input/output validation to every single function you ever write. :)

1 次回应

查看更多评论

要查看或添加评论，请登录

YASH MAHENDRA JOSHI的更多文章

Apache Spark 1.0 to 4.0 : Key advancements of this big-data framework's 14 year journey.

2024年6月27日

Apache Spark 1.0 to 4.0 : Key advancements of this big-data framework's 14 year journey.

Apache Spark: A Milestone in Data Engineering Apache Spark is an open-source distributed computing system which…

2 条评论
Embracing Excellence: My Journey to becoming a Databricks Champion at Accenture

2024年6月17日

Embracing Excellence: My Journey to becoming a Databricks Champion at Accenture

Constructing a house takes time, and Databricks is a bit like masonry – you have different building blocks (from…

2 条评论
Data Lineage on Databricks: what's good and what could be better

2023年12月1日

Data Lineage on Databricks: what's good and what could be better

Finally been able to get my hands on #UnityCatalog set up on #Databricks, and play around with #DataLineage, which is a…

1 条评论
Delta Live Tables on Databricks. When should you use them and when should you not ?

2023年7月15日

Delta Live Tables on Databricks. When should you use them and when should you not ?

Delta Live Tables (DLTs) with #Databricks are great to use but come with their own constraints. I have had the pleasure…

5 条评论
Distributed Big Data Processing with Pyspark.Pandas - Pros and Cons

2022年8月10日

Distributed Big Data Processing with Pyspark.Pandas - Pros and Cons

Pandas is one of world's most loved open-source data analysis and manipulation library built on top of the Python…
What tools does a good Data Engineer in 2022 need to know : Part 1

2022年1月22日

What tools does a good Data Engineer in 2022 need to know : Part 1

It's been a while since I last wrote an article on LinkedIn. During the last 6 months or so, my competencies in the…

1 条评论
Making machine learning forecasts work : a use case of Bitcoin forecasting

2021年10月8日

Making machine learning forecasts work : a use case of Bitcoin forecasting

Forecasting is a complex domain, but not one that is impossible to do with the help of machine learning. Firstly, let's…

4 条评论
What to expect when you are a developer in the AI field: some key lessons I have learnt.

2021年6月9日

What to expect when you are a developer in the AI field: some key lessons I have learnt.

Functional knowledge is key to success in ML/AI projects. Having worked in the supply chain, it's very important to…

1 条评论
From Zero to Hero: how to learn Power BI quickly - excerpts from my learning journey.

2021年3月28日

From Zero to Hero: how to learn Power BI quickly - excerpts from my learning journey.

“Learning never exhausts the mind.” – Leonardo da Vinci Let me be very honest with you, Power BI wasn't at all on my…

5 条评论
2 years into France....and a day in the life of a Data engineer at CGI

2020年10月25日

2 years into France....and a day in the life of a Data engineer at CGI

2 years have flown by since I first landed in Clermont-ferrand, France. A Masters in Business intelligence at ESC…

8 条评论

See all articles

Learning to tackle Big Data problems the right way

YASH MAHENDRA JOSHI

Associate Manager@Accenture France | Data Engineering@Michelin Manufacturing | Professor for Spark, Big Data & Machine Learning

领英推荐

YASH MAHENDRA JOSHI的更多文章

社区洞察

其他会员也浏览了

What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)

What is the Difference between Data Science and Machine Learning?

Codeless Machine Learning for MBA Gurus!!!

Build Statistics and Machine Learning Models Using SQL in Data Distiller

Machine learning data profiling, analysis and transformation with Snowflake and Predactica

From Chaos to Clarity: Using AI to Master Research Complexity

Data Science and Machine Learning Platforms for Your Business: Key Factors and Considerations

Databricks for Machine Learning: A Comprehensive Guide

The 5 Key Data Analytics and AI Skills Latin American Professionals Need for Future Success

A Platform for Machine Learning

领英推荐

YASH MAHENDRA JOSHI的更多文章

Apache Spark 1.0 to 4.0 : Key advancements of this big-data framework's 14 year journey.

Embracing Excellence: My Journey to becoming a Databricks Champion at Accenture

Data Lineage on Databricks: what's good and what could be better

Delta Live Tables on Databricks. When should you use them and when should you not ?

Distributed Big Data Processing with Pyspark.Pandas - Pros and Cons

What tools does a good Data Engineer in 2022 need to know : Part 1

Making machine learning forecasts work : a use case of Bitcoin forecasting

What to expect when you are a developer in the AI field: some key lessons I have learnt.

From Zero to Hero: how to learn Power BI quickly - excerpts from my learning journey.

2 years into France....and a day in the life of a Data engineer at CGI

社区洞察

其他会员也浏览了

What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)

What is the Difference between Data Science and Machine Learning?

Codeless Machine Learning for MBA Gurus!!!

Build Statistics and Machine Learning Models Using SQL in Data Distiller

Machine learning data profiling, analysis and transformation with Snowflake and Predactica

From Chaos to Clarity: Using AI to Master Research Complexity

Data Science and Machine Learning Platforms for Your Business: Key Factors and Considerations

Databricks for Machine Learning: A Comprehensive Guide

The 5 Key Data Analytics and AI Skills Latin American Professionals Need for Future Success

A Platform for Machine Learning