登录查看更多内容

AWS Glue Data Catalog as the Metastore for Databricks

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

发布日期: 2021年11月1日

We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in replacement for a Hive metastore.

Each AWS account owns a single catalog in an AWS region whose catalog ID is the same as the AWS account ID. Using the Glue Catalog as the metastore for Databricks can potentially enable a shared metastore across AWS services, applications, or AWS accounts.

We can configure multiple Databricks workspaces to share the same metastore.

This article shows how to securely access a Glue Data Catalog in Databricks using instance profiles.

Lets Create a Database in AWS Glue

Step1: Login to AWS account & Navigate to AWS Glue service

Step2: Click on Databases & add a database ( I named it - myfirstdatabase)

Our database is created.

Now the idea to populate & access this database from Databricks. exciting ? Let's work on it.

Step1: Create an instance profile to access a Glue Data Catalog

  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GrantCatalogAccessToGlue",
      "Effect": "Allow",
      "Action": [
        "glue:BatchCreatePartition",
        "glue:BatchDeletePartition",
        "glue:BatchGetPartition",
        "glue:CreateDatabase",
        "glue:CreateTable",
        "glue:CreateUserDefinedFunction",
        "glue:DeleteDatabase",
        "glue:DeletePartition",
        "glue:DeleteTable",
        "glue:DeleteUserDefinedFunction",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
        "glue:UpdateDatabase",
        "glue:UpdatePartition",
        "glue:UpdateTable",
        "glue:UpdateUserDefinedFunction"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
{Step4: Look up the IAM role used to create the Databricks deploymen

Step3: Look up the IAM role used to create the Databricks deployment ( Need to be done in Databricks Account )

Step4: Add the Glue Catalog instance profile to the EC2 policy ( Need to be done in AWS Account )

  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1403287045000",
      "Effect": "Allow",
      "Action": [
        "ec2:AssociateDhcpOptions",
        "ec2:AssociateIamInstanceProfile",
        "ec2:AssociateRouteTable",
        "ec2:AttachInternetGateway",
        "ec2:AttachVolume",
        "ec2:AuthorizeSecurityGroupEgress",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:CancelSpotInstanceRequests",
        "ec2:CreateDhcpOptions",
        "ec2:CreateInternetGateway",
        "ec2:CreatePlacementGroup",
        "ec2:CreateRoute",
        "ec2:CreateSecurityGroup",
        "ec2:CreateSubnet",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:CreateVpc",
        "ec2:CreateVpcPeeringConnection",
        "ec2:DeleteInternetGateway",
        "ec2:DeletePlacementGroup",
        "ec2:DeleteRoute",
        "ec2:DeleteRouteTable",
        "ec2:DeleteSecurityGroup",
        "ec2:DeleteSubnet",
        "ec2:DeleteTags",
        "ec2:DeleteVolume",
        "ec2:DeleteVpc",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeIamInstanceProfileAssociations",
        "ec2:DescribeInstanceStatus",
        "ec2:DescribeInstances",
        "ec2:DescribePlacementGroups",
        "ec2:DescribePrefixLists",
        "ec2:DescribeReservedInstancesOfferings",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSpotInstanceRequests",
        "ec2:DescribeSpotPriceHistory",
        "ec2:DescribeSubnets",
        "ec2:DescribeVolumes",
        "ec2:DescribeVpcs",
        "ec2:DetachInternetGateway",
        "ec2:DisassociateIamInstanceProfile",
        "ec2:ModifyVpcAttribute",
        "ec2:ReplaceIamInstanceProfileAssociation",
        "ec2:RequestSpotInstances",
        "ec2:RevokeSecurityGroupEgress",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:RunInstances",
        "ec2:TerminateInstances"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-glue-access>"
    }
  ]
}
{

Step5: Add the Glue Catalog instance profile to a Databricks workspace ( Need to be done in Databricks Workspace )

Step6: Launch a cluster with the Glue Catalog instance profile

领英推荐

Data Virtualization for Google Bigquery with a…

Lyftrondata 6 个月前

Data Virtualization for Google Bigquery with a…

Lyftrondata 6 个月前

Azure Cosmos DB’s Advantages Over Standard Databases

Bizmetric 5 个月前

Step7: Configure Glue Data Catalog as the metastore

To enable Glue Catalog integration, spark.databricks.hive.metastore.glueCatalog.enabled?true. This configuration is disabled by default. That is, the default is to use the Databricks hosted Hive metastore, or some other external metastore if configured.

For interactive or jobs clusters, set the configuration in the cluster configuration?before?cluster startup.

Cluster is now up & running

Now check if we can access the Glue Catalog by use the below command

Let's check in the Glue Console if we have the same databases. ( We know we have because we have created the "myfirstdatabase" in our first step.

Wow!! This is what we were waiting for. We have the setup done. We have configured the External catalog for Databricks. Amazing , isn't it ?

Now let's create a Database & Table from Databricks & Check if it is reflecting in Glue Catalog.

All below steps need to be done in the Databricks Notebook.

Step1: We'll create a database called?glue_db. We'll register this to the same S3 bucket ( I have created a bucket - s3bucket1905 )

Setup AWSKeys ( Ideally we should be doing by instance profile or by secret scope but for now I am directly providing AWS Access Key & AWS Secret Key )

Step2: Read From File & Write into a Table

Step3: When using "saveAsTable" with Glue, we'll write our data as parquet to the S3 bucket specified when we registered our database.

Step4: Now let's check in the S3 bucket for the "parquet" files backed by our table "page"

Step5: One final step to check in the Databricks Data Tab if the table if reflecting & in AWS Glue the newly created database is reflecting or not alongwiith the previous 2 databases.

In Databricks Data Tab:

In AWS Glue:

Wow!! That's quite a work. Lot's of settings needed to be done in AWS & Databricks to make it work. Now from here on we can easily query tables created by Databricks from different AWS services like AWS Athena without even involving Databricks Cluster.

Thats it. We are done.

This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.?

Thanks !! Happy Monday, Happy Learning !!

Abhishek Trivedi

2 年

Very detailed and informative. Bravo ??

Hesham Bataineh

Machine Learning | Data engineering & Software Development

2 年

Hello, I'm trying to configure GlueCatalog with databricks, following the same steps indicated in your article. Something odd happens when I paste the instance profile created into databricks, it fails at the validattion: Verification of the instance profile failed. AWS error: You are not authorized to perform this operation. I've followed the steps using my aws root account, any solution?

Ashok G.

Azure Data Architect | 2k+ Subscriber Youtuber | learning Enthusiasts

2 年

Hi Deepak, do we have any similar feature for Azure AS well?

Sunil Kumar Chinni

Data Architect/Engineer| Batch & Streaming | AWS, Kafka, Python, Kubernetes, Docker, Spark, Data Modelling, Cloud Database(Redshift, Snowflake), ETL Tools

3 年

Thanks for the article Deepak Rajak. Have you seen any performance issues while querying? If yes, any resolution…

Aruna Bhamidipati

Data Quality Specialist at HSBC

3 年

Thank you Deepak. Was trying with Databricks document, but could not get this working. Your post solved the issue. Good tutorial.

查看更多评论

要查看或添加评论，请登录

Deepak Rajak的更多文章

Multi Tasks Job in Databricks

2022年1月12日

Multi Tasks Job in Databricks

A job in Databricks is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or…

3 条评论
Deploying Databricks on Azure

2022年1月10日

Deploying Databricks on Azure

Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

9 条评论
Databricks SQL - The new Cloud Data Ware(Lake)house

2021年11月10日

Databricks SQL - The new Cloud Data Ware(Lake)house

Databricks SQL is a product offering from Databricks which they are pitching against the likes of Snowflake, AWS…

10 条评论
Create Tables in Databricks & Query it from AWS Athena

2021年11月8日

Create Tables in Databricks & Query it from AWS Athena

In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link…

2 条评论
Deploying Databricks on AWS

2021年10月31日

Deploying Databricks on AWS

Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

1 条评论
Danny's Diner Case Study using Pyspark on Databricks

2021年10月6日

Danny's Diner Case Study using Pyspark on Databricks

If you are a Data guy - Analyst, Engineer or Scientist, you needed to explore some good end to end case study / project…

9 条评论
Azure Cloud Data Engineering

2021年6月8日

Azure Cloud Data Engineering

You might have fed up enough by listening to people that the Cloud is the way forward, learn it, everything is going…

22 条评论
Deploying Databricks on Google Cloud Platform

2021年4月13日

Deploying Databricks on Google Cloud Platform

Databricks now available on GCP as well ( Ofcourse already available in AWS & Azure ). In this ultra short article we…

4 条评论
CI / CD in Azure Databricks using Azure DevOps

2021年4月9日

CI / CD in Azure Databricks using Azure DevOps

In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please…

19 条评论
Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

2021年4月3日

Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

If you are a Data guy - Analyst, Engineer or Scientist, you needed to interact with Data [ Files ( different format…

8 条评论

See all articles

AWS Glue Data Catalog as the Metastore for Databricks

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

Step1: Create an instance profile to access a Glue Data Catalog

Step3: Look up the IAM role used to create the Databricks deployment ( Need to be done in Databricks Account )

Step4: Add the Glue Catalog instance profile to the EC2 policy ( Need to be done in AWS Account )

Step5: Add the Glue Catalog instance profile to a Databricks workspace ( Need to be done in Databricks Workspace )

Step6: Launch a cluster with the Glue Catalog instance profile

领英推荐

Step7: Configure Glue Data Catalog as the metastore

All below steps need to be done in the Databricks Notebook.

Deepak Rajak的更多文章

社区洞察

其他会员也浏览了

Cloud Storage and ETL Pricing: A Comparison of Azure, AWS, and GCP

Data Platforms - The Differences between AWS & Azure.

DynamoDB Difinition & Data Modeling

Day - 07 | Databases & Analytics | AWS Cloud Practitioner Certification CLF-C02

Your 4 step guide to copying data from AWS S3 to Snowflake using Matillion

AWS Glue Tutorial for Beginners

AWS DynamoDB Fundamentals | A Complete Guide

Deploy dbt Core Workloads on Azure Using Durable Functions

The “Aggregate” Cloud Data Pattern

Step1: Create an instance profile to access a Glue Data Catalog

Step3: Look up the IAM role used to create the Databricks deployment ( Need to be done in Databricks Account )

Step4: Add the Glue Catalog instance profile to the EC2 policy ( Need to be done in AWS Account )

Step5: Add the Glue Catalog instance profile to a Databricks workspace ( Need to be done in Databricks Workspace )

Step6: Launch a cluster with the Glue Catalog instance profile

领英推荐

Step7: Configure Glue Data Catalog as the metastore

All below steps need to be done in the Databricks Notebook.

Deepak Rajak的更多文章

Multi Tasks Job in Databricks

Deploying Databricks on Azure

Databricks SQL - The new Cloud Data Ware(Lake)house

Create Tables in Databricks & Query it from AWS Athena

Deploying Databricks on AWS

Danny's Diner Case Study using Pyspark on Databricks

Azure Cloud Data Engineering

Deploying Databricks on Google Cloud Platform

CI / CD in Azure Databricks using Azure DevOps

Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

社区洞察

其他会员也浏览了

Cloud Storage and ETL Pricing: A Comparison of Azure, AWS, and GCP

Data Platforms - The Differences between AWS & Azure.

DynamoDB Difinition & Data Modeling

Day - 07 | Databases & Analytics | AWS Cloud Practitioner Certification CLF-C02

Your 4 step guide to copying data from AWS S3 to Snowflake using Matillion

AWS Glue Tutorial for Beginners

AWS DynamoDB Fundamentals | A Complete Guide

Deploy dbt Core Workloads on Azure Using Durable Functions

The “Aggregate” Cloud Data Pattern