Spark - Setting up a Dataproc Cluster on GCP

Spark - Setting up a Dataproc Cluster on GCP


Dataproc?is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto, etc.

Dataproc can be used for data lake modernization, ETL, and secure data science, at planet scale, fully integrated with Google Cloud.

I'm going to cover on this article how to setup a Dataproc Cluster on Google Cloud Platform

Creating a Cluster

  1. Go to Google Cloud Platform console webpage.
  2. In the search bar write Dataproc like in image below and click on the first result.

No alt text provided for this image

3. The first time you run it you'll have to enable the following API.

No alt text provided for this image

4. After that, click on Create Cluster button

No alt text provided for this image

5. Define the properties needed to create the cluster, like the Name, Location and Cluster Type. It's a good practice to define the same Location that was used to set up the Google Cloud Storage bucket.

No alt text provided for this image

6. Optionally, you may also install additional components but we won't be covering them in this article.

No alt text provided for this image

7. Click on Create button

8. Then you have to wait for a few seconds to provision the infrastructure needed, because GCP is creating on the background a virtual machine to support this cluster. If you go to Compute Engine -> VM Instances on the left panel you can see that a new VM instance was created.

No alt text provided for this image

Submitting a job through web console

  1. After status changed to Running we can click on the cluster Name and click on Submit job button.

No alt text provided for this image

2. Change the Job type to PySpark

No alt text provided for this image

3. You have to enter the file path location of the code that we want to execute. The easy way is to put the data into a GCP storage bucket, like showed in the image below.

No alt text provided for this image

4. To refer to a file that is stored on a GCP bucket you have to use the following structure

gs://{storage_bucket_name}/{path_inside_bucket}

        

Here is the example of what I inserted, in this case.

gs://dtc_data_lake_taxi-rides-ny-348613/code/06_spark_sql.py        

5. In case you have to parse arguments to your code, you need to specify on the Arguments form.

6. Click on Submit button.

Submitting a job using gcloud

Using the documentation page from Google Cloud as a source, we have to use the following structure:

gcloud dataproc jobs submit job-command \
????--cluster=cluster-name \
????--region=region \
????other dataproc-flags \
????-- job-args        

Adjusting this structure to my example, becomes something like this:

gcloud dataproc jobs submit pyspark \
? ? --cluster=de-zoomcamp-cluster \
? ? --region=us-central1 \
? ? gs://dtc_data_lake_taxi-rides-ny-348613/code/06_spark_sql.py \
? ? -- \
? ? ? ? --input_green=gs://dtc_data_lake_taxi-rides-ny-348613/pq/green/2021/*/ \
? ? ? ? --input_yellow=gs://dtc_data_lake_taxi-rides-ny-348613/pq/yellow/2021/*/ \
? ? ? ? --output=gs://dtc_data_lake_taxi-rides-ny-348613/report-2021        

I'm using another VM instance that I've created previously on my GCP account to run this code.

No alt text provided for this image

As you can see in the image above, when I try to run this code I get this error message. That happens because I'm using the same service account that I used for setting up other services (ex: Terraform, Airflow, BigQuery, ...) and this account doesn't have permissions to submit jobs to Dataproc.

To change that I have to search for IAM & Admin on GCP web console and click on Edit principals for the service account that we want to use.

No alt text provided for this image

Click on Add another role and search for Dataproc Administrator to give full access and click on Save.

No alt text provided for this image

Now, we can re-execute the code successfully.

No alt text provided for this image
Shivaji Banerjee

looking 4 a good job in kolkata back office

2 年

I am also pursuing Masters in Data Science from Simplilearn

Ramses Alexander Coraspe Valdez

Senior Data Engineer | Big Data Engineer | Data Architect

2 年

要查看或添加评论,请登录

Filipe Balseiro的更多文章

  • Introduction to Streaming - Apache Kafka

    Introduction to Streaming - Apache Kafka

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is a streaming data pipeline? A data pipeline…

  • Apache Spark

    Apache Spark

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

    3 条评论
  • DBT- Data Build Tool (Part II)

    DBT- Data Build Tool (Part II)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Testing and documenting dbt models Although testing…

    2 条评论
  • DBT- Data Build Tool (Part I)

    DBT- Data Build Tool (Part I)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is dbt? dbt stands for data build tool. It's a…

    3 条评论
  • BigQuery

    BigQuery

    Partitioning vs Clustering It's possible to combine both partitioning and clustering in a table, but there are…

  • DataCamp - Data Engineering with Python

    DataCamp - Data Engineering with Python

    Data Engineers Data engineers deliver: The correct data In the right form To the right people As efficiently as…

  • Youtubers Popularity

    Youtubers Popularity

    Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

    12 条评论
  • Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Case Study: Help a bike-share company to convert casual riders into annual members In this article I showcase my…

社区洞察

其他会员也浏览了