?? Terraform Google Cloud Dataproc Project: Automated, Secure, and Scalable Infrastructure ??

?? Terraform Google Cloud Dataproc Project: Automated, Secure, and Scalable Infrastructure ??

I’ve built a comprehensive Terraform configuration to streamline the deployment of a Google Cloud Dataproc cluster along with crucial resources like BigQuery datasets, Cloud Storage buckets, and networking components. This project is designed for high-performance data processing with enterprise-grade security and scalability.


?? Prerequisites

  • Terraform (v1.9.8 or higher)
  • Google Cloud SDK
  • Google Cloud Project with billing enabled
  • Proper IAM permissions for resource management


?? Project Structure

├── provider.tf        # Provider configuration
├── variables.tf       # Variable definitions
├── terraform.tfvars   # Variable values <MY-PROJECT-ID> <MY-PROJECT-NUMBER>
├── iam.tf             # IAM and API configurations
├── network.tf         # VPC and networking resources
├── bucket.tf          # Cloud Storage configurations
├── bigquery.tf        # BigQuery datasets and tables
├── dataproc.tf        # Dataproc cluster configuration
└── jobs.tf            # Dataproc job definitions
        

Resource Components

1?? IAM and API Configuration (iam.tf)

  • Enables key APIs like dataproc.googleapis.com and compute.googleapis.com
  • Creates service accounts with IAM roles:
  • dataproc.admin
  • dataproc.worker
  • storage.objectAdmin
  • bigquery.admin

2?? Networking (network.tf)

  • Sets up a private VPC network with a custom subnet
  • Configures Cloud NAT for controlled internet access
  • Firewall rules for:
  • Internal traffic
  • SSH via Identity-Aware Proxy (IAP)
  • Egress internet access

3?? Storage (bucket.tf)

  • Creates Cloud Storage buckets:
  • Dataproc staging (dataproc-staging-${project_id})
  • ETL scripts (etl-scripts-${project_id})
  • Uploads necessary scripts:
  • bq_compare_insert.py, spark_random_numbers.py, table1_data.csv, startup_script.sh

4?? BigQuery (bigquery.tf)

  • Automates dataset creation (example_dataset)
  • Configures table schemas (id, name, age) and data loading jobs

5?? Dataproc Cluster Configuration (dataproc.tf)

  • Cluster details:
  • Master node: 1x n2-standard-4, 1024GB storage
  • Worker nodes: 2x n2-standard-4, 1024GB storage each
  • Uses Dataproc 2.2-debian12 with JUPYTER integration and custom initialization

6?? Dataproc Jobs (jobs.tf)

  • Automated setup for multiple processing jobs:
  • PySpark: BigQuery comparison, random numbers generation
  • Spark: SparkPi computation
  • Hadoop: WordCount
  • Hive and Pig: Table operations and word count


??? Usage

Clone the repository:

git clone <repository-url>
cd <repository-name>
        

Initialize Terraform:

terraform init
        

Apply configuration:

terraform apply
        

Clean up: Before destroying resources, remove any BigQuery jobs manually, as Terraform doesn’t automatically delete them.

a. Install BigQuery command-line tool:

gcloud components install bq        

b. Check BigQuery job details:

bq show --project_id=<MY-PROJECT-ID> --location=<MY-LOCATION> -j load-table1-job1        

c. Remove BigQuery job:

bq rm -j --location=<MY-LOCATION> --project_id=<MY-PROJECT-ID> load-table1-job1        

Destroy the infrastructure:

terraform destroy         

?? Important Notes

  • Private VPC with internal IPs only for enhanced security
  • Uniform bucket-level access enforced on storage buckets
  • Custom initialization scripts for Dataproc jobs


??? Security Considerations

  • Resources deployed in a private VPC with Cloud NAT and IAP SSH access
  • Firewall rules follow the principle of least privilege


?? Troubleshooting

  • BigQuery Job Deletion: Use the bq command-line tool
  • Networking: Verify Cloud NAT configuration and firewall rules
  • Permission Issues: Ensure IAM roles are properly assigned


For a detailed look, check out the full code on GitHub: GitHub Repository

要查看或添加评论,请登录

Henry Xiloj Herrera的更多文章

社区洞察

其他会员也浏览了