登录查看更多内容

Launch a Hadoop Cluster in 90 Seconds or Less in Google Cloud Dataproc!

Kimoon Kim

Passionate about helping startups succeed with AI & Cloud | Revenue Growth Advisor

发布日期: 2017年11月22日

“I meant whatI I said and I said what I meant. An elephant’s faithful one-hundred percent!” — Dr. Seuss

Hadoop, our favourite elephant, is an open-source framework that allows you to store and analyse big data across clusters of computers. This is achieved by using Google’s MapReduce programming model which you can learn more about here.

Now launching an on-premise Hadoop cluster is not an easy job. I have been fortunate enough to be involved in the process and know the effort it takes to build one. For reference, it generally takes a day and a half per node to launch a working cluster (e.g. a 10-node cluster usually takes around 15 working days). Who has 15 days to build a 10-node cluster?

Lately in South Africa, there have been a lot of interest in Hadoop among corporates and big data startups. That is why I thought it was a good time to write an article which shows how easy it is to launch a Hadoop cluster (step-by-step) using Google Cloud Dataproc. Cloud Dataproc is Google’s fully-managed Hadoop, Spark, and Flink services that allows you to deploy clusters in a simpler, more cost-efficient way. Let me show you how.

What you need

A Google Cloud project with billing enabled which you can learn how do do it here.
Write down your public IP address by typing “What is my IP” in Google.

1. Network Setup

Step 1

Go to your project dashboard that you have created on console.cloud.google.com

Step 2

Click on the “Hamburger-stack” or Menu at the top left hand side of your screen.

Step 3

Under NETWORKING, hover around Network Services and click on the Firewall rules button.

Step 4

Then click Create Firewall Rule.

Step 5

Type in the name you want to give to your Hadoop network.

Step 6

Change the Targets from “Specified target tags” to “All instances in the network”.
Under the Source IP ranges, put in your public IP address that you have noted down in the beginning.
Under Protocols and ports, type: tcp:8088;tcp:50070;tcp:8080 to open specific ports.
Click Create.

2. Dataproc Setup

Step 7

Click on the “Hamburger-stack” or Menu and navigate to Dataproc under the BIG DATA section.

Step 8

Click on the Create cluster button.

Step 9

Give the cluster a name.
Set your specific Machine type for your Master node(s) and Primary disk size (e.g., n1-standard-4 with 10 GB disk size).
Set your specific Machine type for your Worker node(s) and specify how many nodes you require (e.g., x2 n1-standard-4 with 10 GB disk size).
You can go into more detail by expanding the “Preemptible workers, bucket, network, version, initialization, & access options” link.
(Optional) You can create a Hadoop cluster under a different network. For the purpose of this article, I created a hadoop-firewall rule under the default network which this cluster sits under.
Then click on the create button to create the cluster when you are done.

Step 10

Wait (around 90 seconds) till your cluster is created. You’ll know it is done when you see the green tick.

3. Test your Hadoop cluster

Step 11

Click on Compute Engine under the COMPUTE section in your menu and note.

Step 12

Note down your Master Node's External IP address (it is your cluster name followed by -m).

Step 13

On your browser type the External IP: followed by the port (e.g., 35.195.107.25:8088 in my case) and you should see the same screen as I see below.

Congratulations! See how simple it is?

If you want to know more about Hadoop on Dataproc, please go read Tino Tereshko’s great article on Why Dataproc — Google’s managed Hadoop and Spark offering is a game changer.

Follow me on twitter at @kimoon92 and let me know what you think!

-Kimoon Kim

Francois vd Merwe

"Over Qualified Professional Quantity Surveyor | Maximizing Value and Minimizing Costs for Your Projects"

3 年

Kimoon, thanks for sharing!

1 次回应

Tafara Makaza

Co-founder at Fixa | Generation17 Young Leader

6 年

We need this in Zimbabwe. Is it possible to organize an event?

查看更多评论

要查看或添加评论，请登录

Kimoon Kim的更多文章

Flattening the IT expense curves and saving your business from peaks in CapEx costs

2020年4月16日

Flattening the IT expense curves and saving your business from peaks in CapEx costs

You can’t cut your way to growth, but you can cut your way to survival - Anonymous Many companies around the world are…

5 条评论
My views on Google Cloud Dataprep

2017年9月8日

My views on Google Cloud Dataprep

Does data cleaning, data cleansing, data prepping, data alteration etc. ring a bell? If so, this article is for you! I…

2 条评论

Launch a Hadoop Cluster in 90 Seconds or Less in Google Cloud Dataproc!

Kimoon Kim

Passionate about helping startups succeed with AI & Cloud | Revenue Growth Advisor

What you need

1. Network Setup

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

2. Dataproc Setup

Step 7

Step 8

Step 9

Step 10

3. Test your Hadoop cluster

Step 11

Step 12

Step 13

Kimoon Kim的更多文章

社区洞察

其他会员也浏览了

AWS Hadoop Revolutionizing Big Data Analytics

Commercial Distributions of Hadoop: An Overview

HADOOP: "How to integrate LVM with Hadoop and provide Elasticity to DataNode Storage?"

DevBox on EC2 Virtual Machine : All in one Hadoop Ecosystem Implementation on Web

BIG DATA IN LITTLE SPACES: HADOOP AND SPARK AT THE EDGE

How Enterprise Data Observability gives Hadoop an After Life

Innovate faster by migrating from Hadoop to Azure Databricks

Is cloud replacing Hadoop?

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

??Hadoop: how to contribute limited/specific amount of storage as slave to the cluster?

What you need

1. Network Setup

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

2. Dataproc Setup

Step 7

Step 8

Step 9

Step 10

3. Test your Hadoop cluster

Step 11

Step 12

Step 13

Kimoon Kim的更多文章

Flattening the IT expense curves and saving your business from peaks in CapEx costs

My views on Google Cloud Dataprep

社区洞察

其他会员也浏览了

AWS Hadoop Revolutionizing Big Data Analytics

Commercial Distributions of Hadoop: An Overview

HADOOP: "How to integrate LVM with Hadoop and provide Elasticity to DataNode Storage?"

DevBox on EC2 Virtual Machine : All in one Hadoop Ecosystem Implementation on Web

BIG DATA IN LITTLE SPACES: HADOOP AND SPARK AT THE EDGE

How Enterprise Data Observability gives Hadoop an After Life

Innovate faster by migrating from Hadoop to Azure Databricks

Is cloud replacing Hadoop?

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

??Hadoop: how to contribute limited/specific amount of storage as slave to the cluster?