Launch a Hadoop Cluster in 90 Seconds or Less in Google Cloud Dataproc!
Kimoon Kim
Passionate about helping startups succeed with AI & Cloud | Revenue Growth Advisor
“I meant whatI I said and I said what I meant. An elephant’s faithful one-hundred percent!” — Dr. Seuss
Hadoop, our favourite elephant, is an open-source framework that allows you to store and analyse big data across clusters of computers. This is achieved by using Google’s MapReduce programming model which you can learn more about here.
Now launching an on-premise Hadoop cluster is not an easy job. I have been fortunate enough to be involved in the process and know the effort it takes to build one. For reference, it generally takes a day and a half per node to launch a working cluster (e.g. a 10-node cluster usually takes around 15 working days). Who has 15 days to build a 10-node cluster?
Lately in South Africa, there have been a lot of interest in Hadoop among corporates and big data startups. That is why I thought it was a good time to write an article which shows how easy it is to launch a Hadoop cluster (step-by-step) using Google Cloud Dataproc. Cloud Dataproc is Google’s fully-managed Hadoop, Spark, and Flink services that allows you to deploy clusters in a simpler, more cost-efficient way. Let me show you how.
What you need
- A Google Cloud project with billing enabled which you can learn how do do it here.
- Write down your public IP address by typing “What is my IP” in Google.
1. Network Setup
Step 1
Go to your project dashboard that you have created on console.cloud.google.com
Step 2
Click on the “Hamburger-stack” or Menu at the top left hand side of your screen.
Step 3
Under NETWORKING, hover around Network Services and click on the Firewall rules button.
Step 4
Then click Create Firewall Rule.
Step 5
Type in the name you want to give to your Hadoop network.
Step 6
- Change the Targets from “Specified target tags” to “All instances in the network”.
- Under the Source IP ranges, put in your public IP address that you have noted down in the beginning.
- Under Protocols and ports, type: tcp:8088;tcp:50070;tcp:8080 to open specific ports.
- Click Create.
2. Dataproc Setup
Step 7
Click on the “Hamburger-stack” or Menu and navigate to Dataproc under the BIG DATA section.
Step 8
Click on the Create cluster button.
Step 9
- Give the cluster a name.
- Set your specific Machine type for your Master node(s) and Primary disk size (e.g., n1-standard-4 with 10 GB disk size).
- Set your specific Machine type for your Worker node(s) and specify how many nodes you require (e.g., x2 n1-standard-4 with 10 GB disk size).
- You can go into more detail by expanding the “Preemptible workers, bucket, network, version, initialization, & access options” link.
- (Optional) You can create a Hadoop cluster under a different network. For the purpose of this article, I created a hadoop-firewall rule under the default network which this cluster sits under.
- Then click on the create button to create the cluster when you are done.
Step 10
Wait (around 90 seconds) till your cluster is created. You’ll know it is done when you see the green tick.
3. Test your Hadoop cluster
Step 11
Click on Compute Engine under the COMPUTE section in your menu and note.
Step 12
Note down your Master Node's External IP address (it is your cluster name followed by -m).
Step 13
On your browser type the External IP: followed by the port (e.g., 35.195.107.25:8088 in my case) and you should see the same screen as I see below.
Congratulations! See how simple it is?
If you want to know more about Hadoop on Dataproc, please go read Tino Tereshko’s great article on Why Dataproc — Google’s managed Hadoop and Spark offering is a game changer.
Follow me on twitter at @kimoon92 and let me know what you think!
-Kimoon Kim
"Over Qualified Professional Quantity Surveyor | Maximizing Value and Minimizing Costs for Your Projects"
3 年Kimoon, thanks for sharing!
Co-founder at Fixa | Generation17 Young Leader
6 年We need this in Zimbabwe. Is it possible to organize an event?