Learning to tackle Big Data problems the right way

Learning to tackle Big Data problems the right way

We live in a world of exponentially increasing data. Data that is growing in volumetry, in velocity, and in variety. And radically so. Today, the ability to tackle problems of how to source this data, store this data, and most of all, make sense of this data, can get you a job in some of the world's best companies, be it in marketing, finance , supply chain or any other domain, simply because data is a universal need, and a fundamental building block of the Data-Information-Knowledge-Wisdom hierarchy.

No alt text provided for this image


If we jump back in time, in the 1980s and 1990s entered in the concept of Business Intelligence : a domain that is meant for enterprises to organise their data across domains , being able to sift through vast amounts of data AND to be able to have a single version of the truth. It's a domain that is still alive today : in BI schools, we learn here the ways of ETL with Talend and Datastage, and of the Enterprise Data Warehouse with structured data and its staging processes, and possibly most importantly, the ability to harness data with BI tools like Microstrategy / Power BI/ Qlik . But with exponentially growing Big Data, we find it difficult to store unstructured data in Data Warehouses , which are by the way not the ideal solution if you are trying to extract insights with Machine Learning.

Machine learning requires data : and large amounts of it. A number of studies have shown that sometimes just having more data makes for not having the best algorithm : in fact, some competitions have been won by teams that simply have access to MORE, or BETTER QUALITY data. The same applies to a sub-domain of Machine learning known as Deep Learning : training a neural network on 1000 data points versus 10000000 data points makes a tremendous difference to the predictive power of a neural network . Hence, the need was born for data lakes based on HDFS (Hadoop Distributed File System) : to be able to store enormous amount of data, at scale, in the cloud, and cheaply. Wow ! So that's our big data problems solved , right ? Not quite ....

Just having access to enormous quantities of data doesn't do the trick: quality of data, through ACID transactions for example (Atomicity, Consistency, Isolation and Durability) becomes critical to data teams running production workloads. Furthermore, data teams need access to this data at increasingly fast speeds. Hence , the concept of Cloud DataWarehouses like Snowflake, offering enormous power of scalability while structuring data in tables, or of the Delta Lake proposed by Databricks, including the famous concept of the Data Lakehouse, have now come into vogue. But even with the enormous power offered by the clusters of Databricks, or of the VMs of Snowflake, we still have many challenges , notably:

  • How to better distribute our data so that the calculation can be made in parallel ?
  • How to bring enormous amounts of data into memory so that we don't have to use Disk and Network ?
  • How to orchestrate our treatment so we don't waste computation resources ?
  • And how to do this all in a manner that makes maintainance and support easy ?

With the immense amount of choice we have today with tools, technologies and ideas : compute intensive versus memory intensive for instance, making these choices is not always easy.

No alt text provided for this image

We need to go back to basics , and learn the building blocks of Hadoop and MapReduce well. These old-school open-source methods of distrbuted storage and distributed calculation have certain principles that can help us better understand the nature of distributed computing today, and help us to tackle Big Data problems. I often tell my students in Big Data or my colleagues at work to remember the principles of Hadoop : D.R.O => Distribution and repartitions on machines of a cluster, Replication in order to improve reliability , and Optimisation through colocalisation of data and treatments.

No alt text provided for this image

Take the example of a wordcount in Spark today, one of the most in vogue tools for in memory Big Data processing. If you do a wordcount in Spark, and look underneath the hood at the DAG visualisations, you will notice in fact that is built in two stages : a MAP stage and a REDUCE stage. There is a need to shift from Stage 0 to Stage 1 simply because the reduce operation needs to collect all the data from the map operation onto one node before it can process this data. This is fundamentally the same as the MAP-REDUCE paradigm. Of course, this is a simplified example, but we can use the understanding of the map reduce paradigm to help us tackle in the future more complex DAG visualisations and Big data problems. For example, the below complex DAG graph while running Alternating Least Squares algorithm on a spark cluster.

No alt text provided for this image

My point in essence is: we need to go back to building blocks in order to better treat Big Data. We need to look under the hood of big data tools, and try to reverse-engineer them, all while trying to learn the theories of how these tools came into place initially.

No alt text provided for this image


Thanks for reading to the end, and hope you found this article interesting . Give it a thumbs up or share if you did.



Reem Al Rabya

Property Specialist | Trusted Advisor | Client-Centered Solutions

3 年

Congratulations on your first course!

Michael W.

Development and Architecture (8 years), Creator of ONE-FRONT stack & community "santeJS" > sudo hire me

3 年

Thanks for sharing, I just think now that unorganized datastructres, after some time when left not maintained are perhaps sold cheap for scam call centers. Data processing scripts, well defined models, well implemented validation phases are hot, and fixing the data lake durring the data storms will be more complicated over the time. I would suggest defining strict models, followed up with input/output validation to every single function you ever write. :)

要查看或添加评论,请登录

YASH MAHENDRA JOSHI的更多文章

社区洞察

其他会员也浏览了