BigData: The Bigger Picture #2
Welcome back, Learners!
I hope you enjoyed the previous part where we dived into the following topics:
What are we covering today?
Solving storage issues with Distributed Storage
The data is divided into multiple chunks of a particular size and stored separately on each machine of a cluster. Now instead of one machine handling the data storage, we have a cluster to take care of that.
This also supports Horizontal scaling
Now you might be thinking, what if data gets lost/corrupted in one machine?
To handle that we have the concept of Replicas where each data chunk (often referred to as a block) has replicas stored in multiple nodes.
Solving processing issues with Distributed Processing
I think you might have the answer now. Once storage is dealt with, we can do the processing as well. Each block that is stored on a node can be processed independently and in parallel with the block of the same data stored in some other node.
Just imagine multiple blocks being processed in parallel across multiple nodes. The output from different machines gets collected and aggregated on another node and sent as the final output or again stored in a distributed manner.
Here also to change the degree of parallelism we can increase or decrease the no of nodes in the cluster.
领英推荐
Now a word of caution!
It's not as simple as it seems. There are a lot of things going under the hood.
There has to be an entity that manages this entire workflow. There has to be a framework that manages the data replication, final output aggregation, and other complexities involved.
That's where Hadoop comes into the picture. It's a framework to solve Big Data problems by handling Distributed Storage and Processing.
It provides a Map-Reduce (MR) programming model for Distributed Processing and Hadoop Distributed File Storage (HDFS) for Distributed Storage.
The Hadoop 2.0 version also provides YARN which stands for Yet Another Resource Allocator. As the name suggests, it provides the capability of resource management among multiple nodes
What is Apache Spark?
Apache Spark is a better alternative to MapReduce. It is faster and much easier to understand. It's just that same task if you do it with Spark v/s MapReduce, the Spark code would be very concise and easier to understand than the latter. Therefore Spark has become an industry trend nowadays.
Now for distributed storage either you use HDFS, a local environment (just for testing), or any cloud provider service, you can always leverage the processing power of Apache Spark.
Ending note :
That's it guys. I have covered a lot of points today as compared to my previous post. I hope you will find this useful and take it as a source to dig deeper.
Please follow me on LinkedIn to get more insights about me and quick notifications on upcoming parts.
What we will be covering next: Hadoop Distributed File Storage
Feel free to reach out to me on LinkedIn/topmate.