How to Make Hadoop Cluster via Amazon EMR? - NareshIT
Naresh i Technologies
Only Institute to offer the 'Most Comprehensive eLearning Platform to suit the self-learning needs of all CMS and LMS
In this article, we are going to discuss how we can make a Hadoop cluster through AWS EMR, and we will find how easy through AWS it is to run the Hadoop and Big Data apps and then scale them to any limit. The below points are in this article. We will cover how we can make the Hadoop Cluster through the AWS EMR, and finally, we will end up with the demo on creating the EMR cluster in the AWS environment. Naresh I Technologies is the number one computer training institute in Hyderabad, and it's among the top five computer training institutes in India. Contact us now for your complete aws training.
?
?- How to Make Hadoop Cluster through Amazon EMR?
When we search for a thing on Google, the response comes to us in less than a second. How this happens so speedily should come to your mind. The search engine does the crawling over the web, downloads the pages, and finally, makes the index. Whatever query we make, the search engine search through the index.
And we have the PageRank algorithm that ranks the page and the most popular page shown at the top and the least at the bottom.
Now the number of web pages is more than trillions. And hence indexing and page ranking is quite a challenging task. And that's why Yahoo started making use of Hadoop.?Later this came to be known as FOSS or free and open-source software and is under of Apache software foundation. And then many other companies started to make use of Hadoop and enrich it. It led to the Big data revolution and the evolution of software like Spark, Sqoop, Hive, Zookeeper, Cassandra, HBase, and Flume, which also balanced the limitations of Hadoop.
The first use of Hadoop was by Web search engine though later numerous use cases evolved as the data burst took place. Suppose user1 has book1 and book2 and user2 has book1, book2, and book 4. And user 3 has books 5, book 6, and book 7. Thus, book 3 recommended by user 1 is a possibility for user 2. And the book 1,2 and 4 are recommended by users 1 and 2 to user 3. It is what we know as Collaborative filtering, a kind of Machine learning algorithm.
Each page has a list of users and books. And there are indexes for each "user" and "book." The rank of the page is on recommendations for each page. A user on one page can recommend a user on another page book that he doesn't have.
?Now the data size is small, and hence we can visualize the data and come up with some results. However, if the data size increases, visualization is going to go out of control. And then, we will need big data tools such as Hadoop.
We can solve problems through Hadoop. However, it is not easy to install Hadoop and various other Big Data software. We need to do a lot of configuration, such as integration, installation, and issues related to configuring for the complete implementation of them. And here come the requirements of Cloudera, Databricks, and MapR. Through them, we install Big Data Software. And they provide technical support as well. Like, suppose something happens in the production. The Amazon EMR makes the use of Hadoop much simpler. The EMR also caters to the support of the Resilient Distributed Datasets and not only the MapReduce.
Let's find out how we can set up an EMR cluster over the AWS Cloud.
Let's hence see how we can make the Hadoop Cluster via the AWS EMR.
?
- Demo
Step 1: Move to the EMR management console, and tap on "Create Cluster." Within the console, the terminated cluster metadata remains there for around two months for free. Thus, you can clone the terminated cluster and then make them again.
Step 2: Via the quick options screen, you need to select "Go to advanced options" to mention a lot more details related to the cluster.
Step 3: Inside the Advanced Options tab, it's possible to select various software for installation on the EMR cluster. For the SQL interface, "select the Hive." For the DFL language, "select the Pig." For the distributed app coordination, select the Zookeeper and like that. Through this tab, we can also add optional tasks like Bigdata processing jobs with the help of MapReduce, Hive, Pig, and a lot more. We can add all these in this tab. Or, after making the cluster, tap "Next" to select the Hardware, which is essential for the EMR clusters.
Step 4: Hadoop incorporates master-worker architecture. And the entire coordination tasks like scheduling, work assignment, and progress checks are done by the master. And the workers do the actual processing work as well as store the data. You can regard the master as the SPOF or the single point of failure. Also, the Amazon EMR supports the multi-mater to ensure high availability. And the past steps leverage us to form within the EMR the multi-master.
EMR supports two kinds of nodes known as the task and the core. The core does the processing and storing of the data. The task node is only for data processing. In this demo, we are selecting one "core." And no task nodes as this requires less cost. We also need to choose the spot instances and not "on-demand instances" as they are "cheaper." Remember, spot instances are terminated automatically by the AWS However, it triggers a two-minute notice before doing that. Also, we are just through a demo and in some introductory scenarios. Remember, these spot instances get terminated automatically, as they have low priority than the other types of "instances." Now tap on "Next."
Step 5: Now mention the cluster name, and then tap on "Next." You will find that "Termination protection" by default is on. It ensures that the EMR cluster does not accidentally delete up, as you can add a few steps to add before cluster termination.
Step 6: In this tab, various security options are for the EMR. And we need to select the KeyPair for logging inside the EC2 instance. EMR creates the appropriate roles and security groups automatically and attaches these to the master and the worker nodes. Now click on "Create cluster."
The cluster creation requires merely several seconds as we need to buy up the EC2 and the various Big data software needs installing, and configuration needs completion as well. And at the start, the cluster status is going to be the "starting" state, and then it shifts to the "Waiting" state. When the EMR cluster is in the "Waiting state," it waits for the submission of all big data processing jobs such as Map Reduce, Hive, Spark, and much more.
Now, through the EC2 management console, notice whether the master and the worker instances are running or not. You can observe these from the hardware tab within the EMR console as well. Notice them with the hardware tab the price, for these is also shown. It keeps on changing, and you already know this.
Step 7: Thus, we successfully added the EMR. And the steps are for the Big data processing jobs, and you can add them to any number. However, for now, tap on cancel, and we will be discussing this in a separate blog.
Step 8: Let's see how we can start the EMR and how to stop them.
Step 8.1: Now tap on Terminate.
Step 8.2: The termination protection is currently On for the EMR cluster, and the termination button is disabled. Now click on the change.
Step 8.3: Now pick the "Off" radio button and tick mark over it. Now the termination button is enabled. It ensures that we don't terminate the EMR cluster.
Note that the EMR cluster is in terminate status, and EC2 is going to terminate. Ultimately, the EMR cluster is going to change to "terminated status," and this ensures that the billing of the AWS stops. Ensure that the "cluster" terminates, or you will incur some costs.
The above is possible with the help of the AWS CLI, AWS CloudFormation, and AWS SDK. You can thus set up the EMR cluster within a few minutes, and the Big data processing can run in no time, and on processing the result saved in the S3 or the DynamoDB, and as the cluster shuts down, "billing stops." And because of the pricing model and ease of use, EMR is now the first preference of many. You don't need to buy large servers and Big data software licenses anymore. You do not even require to maintain them.
Naresh I Technologies provides one of the best aws training. What you get if you opt for one of the best aws training institutes in India:
Contact us anytime for your AWS training from one of the best aws training institutes in India.
Follow us for More Updates:?https://bit.ly/NITLinkedIN