登录查看更多内容

?? Terraform Google Cloud Dataproc Project: Automated, Secure, and Scalable Infrastructure ??

Henry Xiloj Herrera

Software Engineer | Open Source Advocate

发布日期: 2024年11月2日

I’ve built a comprehensive Terraform configuration to streamline the deployment of a Google Cloud Dataproc cluster along with crucial resources like BigQuery datasets, Cloud Storage buckets, and networking components. This project is designed for high-performance data processing with enterprise-grade security and scalability.

?? Prerequisites

Terraform (v1.9.8 or higher)
Google Cloud SDK
Google Cloud Project with billing enabled
Proper IAM permissions for resource management

?? Project Structure

├── provider.tf        # Provider configuration
├── variables.tf       # Variable definitions
├── terraform.tfvars   # Variable values <MY-PROJECT-ID> <MY-PROJECT-NUMBER>
├── iam.tf             # IAM and API configurations
├── network.tf         # VPC and networking resources
├── bucket.tf          # Cloud Storage configurations
├── bigquery.tf        # BigQuery datasets and tables
├── dataproc.tf        # Dataproc cluster configuration
└── jobs.tf            # Dataproc job definitions

Resource Components

1?? IAM and API Configuration (iam.tf)

Enables key APIs like dataproc.googleapis.com and compute.googleapis.com
Creates service accounts with IAM roles:
dataproc.admin
dataproc.worker
storage.objectAdmin
bigquery.admin

2?? Networking (network.tf)

Sets up a private VPC network with a custom subnet
Configures Cloud NAT for controlled internet access
Firewall rules for:
Internal traffic
SSH via Identity-Aware Proxy (IAP)
Egress internet access

3?? Storage (bucket.tf)

Creates Cloud Storage buckets:
Dataproc staging (dataproc-staging-${project_id})
ETL scripts (etl-scripts-${project_id})
Uploads necessary scripts:
bq_compare_insert.py, spark_random_numbers.py, table1_data.csv, startup_script.sh

4?? BigQuery (bigquery.tf)

Automates dataset creation (example_dataset)
Configures table schemas (id, name, age) and data loading jobs

5?? Dataproc Cluster Configuration (dataproc.tf)

Cluster details:
Master node: 1x n2-standard-4, 1024GB storage
Worker nodes: 2x n2-standard-4, 1024GB storage each
Uses Dataproc 2.2-debian12 with JUPYTER integration and custom initialization

6?? Dataproc Jobs (jobs.tf)

Automated setup for multiple processing jobs:
PySpark: BigQuery comparison, random numbers generation
Spark: SparkPi computation
Hadoop: WordCount
Hive and Pig: Table operations and word count

??? Usage

Clone the repository:

领英推荐

The ScyllaDB Sync: January 2025

ScyllaDB 1 个月前

Understanding Google Cloud AlloyDB Pricing

Bytebase - Database CI/CD and Security at Scale 11 个月前

Step-by-Step Guide to Using Grafana: Unlocking the…

Mysoly E-Learn BV 1 年前

git clone <repository-url>
cd <repository-name>

Initialize Terraform:

terraform init

Apply configuration:

terraform apply

Clean up: Before destroying resources, remove any BigQuery jobs manually, as Terraform doesn’t automatically delete them.

a. Install BigQuery command-line tool:

gcloud components install bq

b. Check BigQuery job details:

bq show --project_id=<MY-PROJECT-ID> --location=<MY-LOCATION> -j load-table1-job1

c. Remove BigQuery job:

bq rm -j --location=<MY-LOCATION> --project_id=<MY-PROJECT-ID> load-table1-job1

Destroy the infrastructure:

terraform destroy

?? Important Notes

Private VPC with internal IPs only for enhanced security
Uniform bucket-level access enforced on storage buckets
Custom initialization scripts for Dataproc jobs

??? Security Considerations

Resources deployed in a private VPC with Cloud NAT and IAP SSH access
Firewall rules follow the principle of least privilege

?? Troubleshooting

BigQuery Job Deletion: Use the bq command-line tool
Networking: Verify Cloud NAT configuration and firewall rules
Permission Issues: Ensure IAM roles are properly assigned

For a detailed look, check out the full code on GitHub: GitHub Repository

要查看或添加评论，请登录

Henry Xiloj Herrera的更多文章

Integrating Google Cloud Pub/Sub with Terraform and Spring Boot 3 (Java 21)

2025年2月19日

Integrating Google Cloud Pub/Sub with Terraform and Spring Boot 3 (Java 21)

Introduction In this blog post, I'll demonstrate how to provision Google Cloud Pub/Sub resources using Terraform and…
How to Sync Terraform with Existing GCP Resources & Avoid instanceAlreadyExists Errors ??

2025年2月10日

How to Sync Terraform with Existing GCP Resources & Avoid instanceAlreadyExists Errors ??

?? One common challenge when managing cloud infrastructure with Terraform is syncing existing resources with the…
Integrating Ollama with DeepSeek-R1 in Spring Boot

2025年1月28日

Integrating Ollama with DeepSeek-R1 in Spring Boot

Are you looking to leverage the power of Ollama and DeepSeek-R1 in your Spring Boot application? This post will walk…

2 条评论
Spring Retry: Handling Transient Failures Gracefully in Java 21

2025年1月26日

Spring Retry: Handling Transient Failures Gracefully in Java 21

In modern applications, transient failures (e.g.
Constructor Injection vs. @Autowired: Spring Boot 3

2025年1月4日

Constructor Injection vs. @Autowired: Spring Boot 3

Spring Boot 3, constructor injection is considered the best practice over @Autowired. Here's why: 1.
Virtual Threads in Java 21: Simplified Concurrency for Modern Applications

2024年12月8日

Virtual Threads in Java 21: Simplified Concurrency for Modern Applications

With Java 21, Virtual Threads have redefined how we approach concurrency, offering a lightweight and efficient way to…
IaC Project: Multi-GCP Cloud Networking with Composer v3, VMs, and Cloud SQL PSC (Host & Remote)

2024年12月4日

IaC Project: Multi-GCP Cloud Networking with Composer v3, VMs, and Cloud SQL PSC (Host & Remote)

?? Terraform Project, hosted on GitHub! Explore this comprehensive repository that demonstrates how to provision and…
Connecting Airflow 2 in Composer 3 to Cloud SQL via Private Service Connect (PSC) VPC Network.

2024年10月13日

Connecting Airflow 2 in Composer 3 to Cloud SQL via Private Service Connect (PSC) VPC Network.

This guide demonstrates how to connect Airflow 2 in Google Cloud Composer 3 to a Cloud SQL instance using Private…
Deploying Spring Boot 3 on Cloud Run with Cloud SQL Private Service Connect Using Terraform

2024年9月22日

Deploying Spring Boot 3 on Cloud Run with Cloud SQL Private Service Connect Using Terraform

Introduction: In this post, we'll walk through the steps required to create Infrastructure as Code (IaC) with Terraform…
Deploying Python 3.12 on Cloud Run with Cloud SQL Private Service Connect Using Terraform

2024年9月22日

Deploying Python 3.12 on Cloud Run with Cloud SQL Private Service Connect Using Terraform

Introduction: In this post, we will walk through the steps required to provision Cloud SQL Private Service Connect…

See all articles

?? Terraform Google Cloud Dataproc Project: Automated, Secure, and Scalable Infrastructure ??

Henry Xiloj Herrera

Software Engineer | Open Source Advocate

领英推荐

Henry Xiloj Herrera的更多文章

社区洞察

其他会员也浏览了

Timeplus support for Redpanda Serverless, updates to demos, and enhanced External Tables/Streams

AWS Athena: How to Create Logging System That You Can Search, Filter, Group, and Analyze Using SQL Queries

Week 23 (3 Jun - 9 Jun)

What Is Managed Workflows for Apache Airflow On AWS And Why Companies Should Migrate To It

Day - 07 | Databases & Analytics | AWS Cloud Practitioner Certification CLF-C02

Week 27 (1 Jul - 7 Jul)

AWS Cloudscape Atlas : Envisioning possibilities with AWS services

MongoDB Atlas Custom JWT Authentication made Easy

AWS update of Week 12 (20Mar-26Mar)

EZF-005. AWS, Azure, GCP based on data type; VMs vs Containers; WebSockets, Python List Methods, Back-of-the-envelope estimation

领英推荐

Henry Xiloj Herrera的更多文章

Integrating Google Cloud Pub/Sub with Terraform and Spring Boot 3 (Java 21)

How to Sync Terraform with Existing GCP Resources & Avoid instanceAlreadyExists Errors ??

Integrating Ollama with DeepSeek-R1 in Spring Boot

Spring Retry: Handling Transient Failures Gracefully in Java 21

Constructor Injection vs. @Autowired: Spring Boot 3

Virtual Threads in Java 21: Simplified Concurrency for Modern Applications

IaC Project: Multi-GCP Cloud Networking with Composer v3, VMs, and Cloud SQL PSC (Host & Remote)

Connecting Airflow 2 in Composer 3 to Cloud SQL via Private Service Connect (PSC) VPC Network.

Deploying Spring Boot 3 on Cloud Run with Cloud SQL Private Service Connect Using Terraform

Deploying Python 3.12 on Cloud Run with Cloud SQL Private Service Connect Using Terraform

社区洞察

其他会员也浏览了

Timeplus support for Redpanda Serverless, updates to demos, and enhanced External Tables/Streams

AWS Athena: How to Create Logging System That You Can Search, Filter, Group, and Analyze Using SQL Queries

Week 23 (3 Jun - 9 Jun)

What Is Managed Workflows for Apache Airflow On AWS And Why Companies Should Migrate To It

Day - 07 | Databases & Analytics | AWS Cloud Practitioner Certification CLF-C02

Week 27 (1 Jul - 7 Jul)

AWS Cloudscape Atlas : Envisioning possibilities with AWS services

MongoDB Atlas Custom JWT Authentication made Easy

AWS update of Week 12 (20Mar-26Mar)

EZF-005. AWS, Azure, GCP based on data type; VMs vs Containers; WebSockets, Python List Methods, Back-of-the-envelope estimation