登录查看更多内容

BigData: The Bigger Picture #2

Ayush Srivastava

SDE 2 @ Upstox || NIT Agartala - 2020

发布日期: 2023年6月18日

+ 关注

Welcome back, Learners!

I hope you enjoyed the previous part where we dived into the following topics:

Real case example of BigData
Introduction to BigData
The problem statement we are trying to solve
Related tools and technologies

What are we covering today?

Solving storage issues demystified
Solving processing issues demystified
A brief introduction to Hadoop and Apache Spark

Solving storage issues with Distributed Storage

The data is divided into multiple chunks of a particular size and stored separately on each machine of a cluster. Now instead of one machine handling the data storage, we have a cluster to take care of that.

This also supports Horizontal scaling where we can increase/decrease the no of machines(each often called a node) as per our requirement.

Now you might be thinking, what if data gets lost/corrupted in one machine?

To handle that we have the concept of Replicas where each data chunk (often referred to as a block) has replicas stored in multiple nodes.

Solving processing issues with Distributed Processing

I think you might have the answer now. Once storage is dealt with, we can do the processing as well. Each block that is stored on a node can be processed independently and in parallel with the block of the same data stored in some other node.

Just imagine multiple blocks being processed in parallel across multiple nodes. The output from different machines gets collected and aggregated on another node and sent as the final output or again stored in a distributed manner.

Here also to change the degree of parallelism we can increase or decrease the no of nodes in the cluster.

领英推荐

Apache Spark: Key Advantages Over Hadoop and the Power…

Omar Khaled 5 个月前

Understanding Spark on YARN Architecture

Sachin D N ???? 1 年前

Unlocking the Power of Apache Spark: A Comprehensive…

Udaya G 4 周前

Now a word of caution!

It's not as simple as it seems. There are a lot of things going under the hood.

There has to be an entity that manages this entire workflow. There has to be a framework that manages the data replication, final output aggregation, and other complexities involved.

That's where Hadoop comes into the picture. It's a framework to solve Big Data problems by handling Distributed Storage and Processing.

It provides a Map-Reduce (MR) programming model for Distributed Processing and Hadoop Distributed File Storage (HDFS) for Distributed Storage.

The Hadoop 2.0 version also provides YARN which stands for Yet Another Resource Allocator. As the name suggests, it provides the capability of resource management among multiple nodes.

What is Apache Spark?

Apache Spark is a better alternative to MapReduce. It is faster and much easier to understand. It's just that same task if you do it with Spark v/s MapReduce, the Spark code would be very concise and easier to understand than the latter. Therefore Spark has become an industry trend nowadays.

Now for distributed storage either you use HDFS, a local environment (just for testing), or any cloud provider service, you can always leverage the processing power of Apache Spark.

Ending note :

That's it guys. I have covered a lot of points today as compared to my previous post. I hope you will find this useful and take it as a source to dig deeper.

Please follow me on LinkedIn to get more insights about me and quick notifications on upcoming parts.

What we will be covering next: Hadoop Distributed File Storage

Feel free to reach out to me on LinkedIn/topmate.

要查看或添加评论，请登录

Ayush Srivastava的更多文章

Big Data: The Bigger Picture #4

2023年8月5日

Big Data: The Bigger Picture #4

Pro tip: To understand the Big data basics and Distributed storage, you can go through my previous pulses. In this…
Hadoop Distributed File Storage

2023年6月25日

Hadoop Distributed File Storage

Welcome back to my 3rd part in the series Big Data: The Bigger Picture In the last part, we discussed the basic idea…
BigData: The Bigger Picture #1

2023年6月13日

BigData: The Bigger Picture #1

Who am I? My name is Ayush Srivastava, currently working as a Software Engineer 2 at PayPal. I am a Big data enthusiast…

BigData: The Bigger Picture #2

Ayush Srivastava

SDE 2 @ Upstox || NIT Agartala - 2020

Solving storage issues with Distributed Storage

Solving processing issues with Distributed Processing

领英推荐

What is Apache Spark?

Ending note :

Ayush Srivastava的更多文章

社区洞察

其他会员也浏览了

Is Hadoop Sinking with the Emergence of AI & Machine Learning?

Hadoop - Managers' snapshot

Opensource for 5G A Neanderthal’s Guide : Corral Big Data with Hadoop and Apache Spark @

Mastering Big Data: A Guide to MapReduce, Spark, and Hive for Industry Success ????? Introduction: Big Data and Its Impact

Spark - Managers' snapshot

Analytics 4.0 - May the Hadoop be with you!

Data Glossary: Know the terms. #BigData

Spark with Kubernetes

Solving storage issues with Distributed Storage

Solving processing issues with Distributed Processing

领英推荐

What is Apache Spark?

Ending note :

Ayush Srivastava的更多文章

Big Data: The Bigger Picture #4

Hadoop Distributed File Storage

BigData: The Bigger Picture #1

社区洞察

其他会员也浏览了

Is Hadoop Sinking with the Emergence of AI & Machine Learning?

Hadoop - Managers' snapshot

Opensource for 5G A Neanderthal’s Guide : Corral Big Data with Hadoop and Apache Spark @

Mastering Big Data: A Guide to MapReduce, Spark, and Hive for Industry Success ????? Introduction: Big Data and Its Impact

Spark - Managers' snapshot

Analytics 4.0 - May the Hadoop be with you!

Data Glossary: Know the terms. #BigData

Spark with Kubernetes