登录查看更多内容

Spark - Setting up a Dataproc Cluster on GCP

Filipe Balseiro

?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!

发布日期: 2022年6月17日

Dataproc?is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto, etc.

Dataproc can be used for data lake modernization, ETL, and secure data science, at planet scale, fully integrated with Google Cloud.

I'm going to cover on this article how to setup a Dataproc Cluster on Google Cloud Platform

Creating a Cluster

Go to Google Cloud Platform console webpage.
In the search bar write Dataproc like in image below and click on the first result.

3. The first time you run it you'll have to enable the following API.

4. After that, click on Create Cluster button

5. Define the properties needed to create the cluster, like the Name, Location and Cluster Type. It's a good practice to define the same Location that was used to set up the Google Cloud Storage bucket.

6. Optionally, you may also install additional components but we won't be covering them in this article.

7. Click on Create button

8. Then you have to wait for a few seconds to provision the infrastructure needed, because GCP is creating on the background a virtual machine to support this cluster. If you go to Compute Engine -> VM Instances on the left panel you can see that a new VM instance was created.

Submitting a job through web console

After status changed to Running we can click on the cluster Name and click on Submit job button.

2. Change the Job type to PySpark

领英推荐

DynamoDB Difinition & Data Modeling

Omar Ismail 3 年前

Amazon Athena

Rohit Singh 4 个月前

AWS DynamoDB Fundamentals | A Complete Guide

Huzaifa Asif 2 年前

3. You have to enter the file path location of the code that we want to execute. The easy way is to put the data into a GCP storage bucket, like showed in the image below.

4. To refer to a file that is stored on a GCP bucket you have to use the following structure

gs://{storage_bucket_name}/{path_inside_bucket}

Here is the example of what I inserted, in this case.

gs://dtc_data_lake_taxi-rides-ny-348613/code/06_spark_sql.py

5. In case you have to parse arguments to your code, you need to specify on the Arguments form.

6. Click on Submit button.

Submitting a job using gcloud

Using the documentation page from Google Cloud as a source, we have to use the following structure:

gcloud dataproc jobs submit job-command \
????--cluster=cluster-name \
????--region=region \
????other dataproc-flags \
????-- job-args

Adjusting this structure to my example, becomes something like this:

gcloud dataproc jobs submit pyspark \
? ? --cluster=de-zoomcamp-cluster \
? ? --region=us-central1 \
? ? gs://dtc_data_lake_taxi-rides-ny-348613/code/06_spark_sql.py \
? ? -- \
? ? ? ? --input_green=gs://dtc_data_lake_taxi-rides-ny-348613/pq/green/2021/*/ \
? ? ? ? --input_yellow=gs://dtc_data_lake_taxi-rides-ny-348613/pq/yellow/2021/*/ \
? ? ? ? --output=gs://dtc_data_lake_taxi-rides-ny-348613/report-2021

I'm using another VM instance that I've created previously on my GCP account to run this code.

As you can see in the image above, when I try to run this code I get this error message. That happens because I'm using the same service account that I used for setting up other services (ex: Terraform, Airflow, BigQuery, ...) and this account doesn't have permissions to submit jobs to Dataproc.

To change that I have to search for IAM & Admin on GCP web console and click on Edit principals for the service account that we want to use.

Click on Add another role and search for Dataproc Administrator to give full access and click on Save.

Now, we can re-execute the code successfully.

Shivaji Banerjee

looking 4 a good job in kolkata back office

2 年

I am also pursuing Masters in Data Science from Simplilearn

1 次回应

Ramses Alexander Coraspe Valdez

Senior Data Engineer | Big Data Engineer | Data Architect

2 年

Hi Filipe Balseiro use cloud resources should be more transparent, check this out: https://itnext.io/how-to-build-a-dag-based-task-scheduling-tool-for-multiprocessor-systems-using-python-d11a093a835b

1 次回应

查看更多评论

要查看或添加评论，请登录

Filipe Balseiro的更多文章

Introduction to Streaming - Apache Kafka

2022年6月18日

Introduction to Streaming - Apache Kafka

References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is a streaming data pipeline? A data pipeline…
Apache Spark

2022年6月13日

Apache Spark

References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

3 条评论
DBT- Data Build Tool (Part II)

2022年6月11日

DBT- Data Build Tool (Part II)

References Alvaro Navas Notes Data Engineering Zoomcamp Repository Testing and documenting dbt models Although testing…

2 条评论
DBT- Data Build Tool (Part I)

2022年6月10日

DBT- Data Build Tool (Part I)

References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is dbt? dbt stands for data build tool. It's a…

3 条评论
BigQuery

2022年5月5日

BigQuery

Partitioning vs Clustering It's possible to combine both partitioning and clustering in a table, but there are…
DataCamp - Data Engineering with Python

2022年4月24日

DataCamp - Data Engineering with Python

Data Engineers Data engineers deliver: The correct data In the right form To the right people As efficiently as…
Youtubers Popularity

2022年3月14日

Youtubers Popularity

Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

12 条评论
Google Data Analytics Professional Certificate Capstone Project: Cyclistic

2022年1月29日

Google Data Analytics Professional Certificate Capstone Project: Cyclistic

Case Study: Help a bike-share company to convert casual riders into annual members In this article I showcase my…

See all articles

Spark - Setting up a Dataproc Cluster on GCP

Filipe Balseiro

?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!

Creating a Cluster

Submitting a job through web console

领英推荐

Submitting a job using gcloud

Filipe Balseiro的更多文章

社区洞察

其他会员也浏览了

Redis Released - The Future of Technology is Here

Azure Cosmos DB’s Advantages Over Standard Databases

Comparing Big Data Pipelines on AWS, Microsoft Azure, and Google Cloud Platform

Big Data - AWS, Azure, GCP Offerings

Reading from Azure DataLake & Writing to Google BigQuery via Databricks

Learn How to Use New S3 Table Buckets and Build Iceberg Tables on EMR 7.5 | Hands-On Labs

AWS Glue Data Catalog as the Metastore for Databricks

AWS EMR Cost Optimization

Serverless Data? Is that even possible?

The Best of BigQuery From Google Cloud Next'21 Via My Lens. ??

Creating a Cluster

Submitting a job through web console

领英推荐

Submitting a job using gcloud

Filipe Balseiro的更多文章

Introduction to Streaming - Apache Kafka

Apache Spark

DBT- Data Build Tool (Part II)

DBT- Data Build Tool (Part I)

BigQuery

DataCamp - Data Engineering with Python

Youtubers Popularity

Google Data Analytics Professional Certificate Capstone Project: Cyclistic

社区洞察

其他会员也浏览了

Redis Released - The Future of Technology is Here

Azure Cosmos DB’s Advantages Over Standard Databases

Comparing Big Data Pipelines on AWS, Microsoft Azure, and Google Cloud Platform

Big Data - AWS, Azure, GCP Offerings

Reading from Azure DataLake & Writing to Google BigQuery via Databricks

Learn How to Use New S3 Table Buckets and Build Iceberg Tables on EMR 7.5 | Hands-On Labs

AWS Glue Data Catalog as the Metastore for Databricks

AWS EMR Cost Optimization

Serverless Data? Is that even possible?

The Best of BigQuery From Google Cloud Next'21 Via My Lens. ??