BigData: The Bigger Picture #2

BigData: The Bigger Picture #2

Welcome back, Learners!

I hope you enjoyed the previous part where we dived into the following topics:

  • Real case example of BigData
  • Introduction to BigData
  • The problem statement we are trying to solve
  • Related tools and technologies


What are we covering today?

  1. Solving storage issues demystified
  2. Solving processing issues demystified
  3. A brief introduction to Hadoop and Apache Spark


Solving storage issues with Distributed Storage

The data is divided into multiple chunks of a particular size and stored separately on each machine of a cluster. Now instead of one machine handling the data storage, we have a cluster to take care of that.

This also supports Horizontal scaling where we can increase/decrease the no of machines(each often called a node) as per our requirement.

Now you might be thinking, what if data gets lost/corrupted in one machine?

To handle that we have the concept of Replicas where each data chunk (often referred to as a block) has replicas stored in multiple nodes.


Solving processing issues with Distributed Processing

I think you might have the answer now. Once storage is dealt with, we can do the processing as well. Each block that is stored on a node can be processed independently and in parallel with the block of the same data stored in some other node.

Just imagine multiple blocks being processed in parallel across multiple nodes. The output from different machines gets collected and aggregated on another node and sent as the final output or again stored in a distributed manner.

Here also to change the degree of parallelism we can increase or decrease the no of nodes in the cluster.

Now a word of caution!

It's not as simple as it seems. There are a lot of things going under the hood.

There has to be an entity that manages this entire workflow. There has to be a framework that manages the data replication, final output aggregation, and other complexities involved.

That's where Hadoop comes into the picture. It's a framework to solve Big Data problems by handling Distributed Storage and Processing.

It provides a Map-Reduce (MR) programming model for Distributed Processing and Hadoop Distributed File Storage (HDFS) for Distributed Storage.

The Hadoop 2.0 version also provides YARN which stands for Yet Another Resource Allocator. As the name suggests, it provides the capability of resource management among multiple nodes.

What is Apache Spark?

Apache Spark is a better alternative to MapReduce. It is faster and much easier to understand. It's just that same task if you do it with Spark v/s MapReduce, the Spark code would be very concise and easier to understand than the latter. Therefore Spark has become an industry trend nowadays.

Now for distributed storage either you use HDFS, a local environment (just for testing), or any cloud provider service, you can always leverage the processing power of Apache Spark.


Ending note :

That's it guys. I have covered a lot of points today as compared to my previous post. I hope you will find this useful and take it as a source to dig deeper.

Please follow me on LinkedIn to get more insights about me and quick notifications on upcoming parts.

What we will be covering next: Hadoop Distributed File Storage


Feel free to reach out to me on LinkedIn/topmate.


要查看或添加评论,请登录

Ayush Srivastava的更多文章

  • Big Data: The Bigger Picture #4

    Big Data: The Bigger Picture #4

    Pro tip: To understand the Big data basics and Distributed storage, you can go through my previous pulses. In this…

  • Hadoop Distributed File Storage

    Hadoop Distributed File Storage

    Welcome back to my 3rd part in the series Big Data: The Bigger Picture In the last part, we discussed the basic idea…

  • BigData: The Bigger Picture #1

    BigData: The Bigger Picture #1

    Who am I? My name is Ayush Srivastava, currently working as a Software Engineer 2 at PayPal. I am a Big data enthusiast…

社区洞察

其他会员也浏览了