Spark - Setting up a Dataproc Cluster on GCP
Filipe Balseiro
?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!
Dataproc?is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto, etc.
Dataproc can be used for data lake modernization, ETL, and secure data science, at planet scale, fully integrated with Google Cloud.
I'm going to cover on this article how to setup a Dataproc Cluster on Google Cloud Platform
Creating a Cluster
3. The first time you run it you'll have to enable the following API.
4. After that, click on Create Cluster button
5. Define the properties needed to create the cluster, like the Name, Location and Cluster Type. It's a good practice to define the same Location that was used to set up the Google Cloud Storage bucket.
6. Optionally, you may also install additional components but we won't be covering them in this article.
7. Click on Create button
8. Then you have to wait for a few seconds to provision the infrastructure needed, because GCP is creating on the background a virtual machine to support this cluster. If you go to Compute Engine -> VM Instances on the left panel you can see that a new VM instance was created.
Submitting a job through web console
2. Change the Job type to PySpark
3. You have to enter the file path location of the code that we want to execute. The easy way is to put the data into a GCP storage bucket, like showed in the image below.
4. To refer to a file that is stored on a GCP bucket you have to use the following structure
gs://{storage_bucket_name}/{path_inside_bucket}
Here is the example of what I inserted, in this case.
gs://dtc_data_lake_taxi-rides-ny-348613/code/06_spark_sql.py
5. In case you have to parse arguments to your code, you need to specify on the Arguments form.
6. Click on Submit button.
Submitting a job using gcloud
Using the documentation page from Google Cloud as a source, we have to use the following structure:
gcloud dataproc jobs submit job-command \
????--cluster=cluster-name \
????--region=region \
????other dataproc-flags \
????-- job-args
Adjusting this structure to my example, becomes something like this:
gcloud dataproc jobs submit pyspark \
? ? --cluster=de-zoomcamp-cluster \
? ? --region=us-central1 \
? ? gs://dtc_data_lake_taxi-rides-ny-348613/code/06_spark_sql.py \
? ? -- \
? ? ? ? --input_green=gs://dtc_data_lake_taxi-rides-ny-348613/pq/green/2021/*/ \
? ? ? ? --input_yellow=gs://dtc_data_lake_taxi-rides-ny-348613/pq/yellow/2021/*/ \
? ? ? ? --output=gs://dtc_data_lake_taxi-rides-ny-348613/report-2021
I'm using another VM instance that I've created previously on my GCP account to run this code.
As you can see in the image above, when I try to run this code I get this error message. That happens because I'm using the same service account that I used for setting up other services (ex: Terraform, Airflow, BigQuery, ...) and this account doesn't have permissions to submit jobs to Dataproc.
To change that I have to search for IAM & Admin on GCP web console and click on Edit principals for the service account that we want to use.
Click on Add another role and search for Dataproc Administrator to give full access and click on Save.
Now, we can re-execute the code successfully.
looking 4 a good job in kolkata back office
2 年I am also pursuing Masters in Data Science from Simplilearn
Senior Data Engineer | Big Data Engineer | Data Architect
2 年Hi Filipe Balseiro use cloud resources should be more transparent, check this out: https://itnext.io/how-to-build-a-dag-based-task-scheduling-tool-for-multiprocessor-systems-using-python-d11a093a835b