AWS Glue Data Catalog as the Metastore for Databricks

AWS Glue Data Catalog as the Metastore for Databricks

We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in replacement for a Hive metastore.

Each AWS account owns a single catalog in an AWS region whose catalog ID is the same as the AWS account ID. Using the Glue Catalog as the metastore for Databricks can potentially enable a shared metastore across AWS services, applications, or AWS accounts.

We can configure multiple Databricks workspaces to share the same metastore.

This article shows how to securely access a Glue Data Catalog in Databricks using instance profiles.

Lets Create a Database in AWS Glue

Step1: Login to AWS account & Navigate to AWS Glue service

No alt text provided for this image

Step2: Click on Databases & add a database ( I named it - myfirstdatabase)

No alt text provided for this image
No alt text provided for this image

Our database is created.

Now the idea to populate & access this database from Databricks. exciting ? Let's work on it.

Step1: Create an instance profile to access a Glue Data Catalog

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image


  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GrantCatalogAccessToGlue",
      "Effect": "Allow",
      "Action": [
        "glue:BatchCreatePartition",
        "glue:BatchDeletePartition",
        "glue:BatchGetPartition",
        "glue:CreateDatabase",
        "glue:CreateTable",
        "glue:CreateUserDefinedFunction",
        "glue:DeleteDatabase",
        "glue:DeletePartition",
        "glue:DeleteTable",
        "glue:DeleteUserDefinedFunction",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
        "glue:UpdateDatabase",
        "glue:UpdatePartition",
        "glue:UpdateTable",
        "glue:UpdateUserDefinedFunction"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
{Step4: Look up the IAM role used to create the Databricks deploymen        
No alt text provided for this image


Step3: Look up the IAM role used to create the Databricks deployment ( Need to be done in Databricks Account )

No alt text provided for this image
No alt text provided for this image

Step4: Add the Glue Catalog instance profile to the EC2 policy ( Need to be done in AWS Account )

No alt text provided for this image


  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1403287045000",
      "Effect": "Allow",
      "Action": [
        "ec2:AssociateDhcpOptions",
        "ec2:AssociateIamInstanceProfile",
        "ec2:AssociateRouteTable",
        "ec2:AttachInternetGateway",
        "ec2:AttachVolume",
        "ec2:AuthorizeSecurityGroupEgress",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:CancelSpotInstanceRequests",
        "ec2:CreateDhcpOptions",
        "ec2:CreateInternetGateway",
        "ec2:CreatePlacementGroup",
        "ec2:CreateRoute",
        "ec2:CreateSecurityGroup",
        "ec2:CreateSubnet",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:CreateVpc",
        "ec2:CreateVpcPeeringConnection",
        "ec2:DeleteInternetGateway",
        "ec2:DeletePlacementGroup",
        "ec2:DeleteRoute",
        "ec2:DeleteRouteTable",
        "ec2:DeleteSecurityGroup",
        "ec2:DeleteSubnet",
        "ec2:DeleteTags",
        "ec2:DeleteVolume",
        "ec2:DeleteVpc",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeIamInstanceProfileAssociations",
        "ec2:DescribeInstanceStatus",
        "ec2:DescribeInstances",
        "ec2:DescribePlacementGroups",
        "ec2:DescribePrefixLists",
        "ec2:DescribeReservedInstancesOfferings",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSpotInstanceRequests",
        "ec2:DescribeSpotPriceHistory",
        "ec2:DescribeSubnets",
        "ec2:DescribeVolumes",
        "ec2:DescribeVpcs",
        "ec2:DetachInternetGateway",
        "ec2:DisassociateIamInstanceProfile",
        "ec2:ModifyVpcAttribute",
        "ec2:ReplaceIamInstanceProfileAssociation",
        "ec2:RequestSpotInstances",
        "ec2:RevokeSecurityGroupEgress",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:RunInstances",
        "ec2:TerminateInstances"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-glue-access>"
    }
  ]
}
{        
No alt text provided for this image


Step5: Add the Glue Catalog instance profile to a Databricks workspace ( Need to be done in Databricks Workspace )


No alt text provided for this image


No alt text provided for this image


Step6: Launch a cluster with the Glue Catalog instance profile

No alt text provided for this image
No alt text provided for this image

Step7: Configure Glue Data Catalog as the metastore

To enable Glue Catalog integration, spark.databricks.hive.metastore.glueCatalog.enabled?true. This configuration is disabled by default. That is, the default is to use the Databricks hosted Hive metastore, or some other external metastore if configured.

For interactive or jobs clusters, set the configuration in the cluster configuration?before?cluster startup.

No alt text provided for this image

Cluster is now up & running

No alt text provided for this image

Now check if we can access the Glue Catalog by use the below command

No alt text provided for this image

Let's check in the Glue Console if we have the same databases. ( We know we have because we have created the "myfirstdatabase" in our first step.

No alt text provided for this image

Wow!! This is what we were waiting for. We have the setup done. We have configured the External catalog for Databricks. Amazing , isn't it ?

Now let's create a Database & Table from Databricks & Check if it is reflecting in Glue Catalog.

All below steps need to be done in the Databricks Notebook.

Step1: We'll create a database called?glue_db. We'll register this to the same S3 bucket ( I have created a bucket - s3bucket1905 )

Setup AWSKeys ( Ideally we should be doing by instance profile or by secret scope but for now I am directly providing AWS Access Key & AWS Secret Key )

No alt text provided for this image


No alt text provided for this image


Step2: Read From File & Write into a Table

No alt text provided for this image

Step3: When using "saveAsTable" with Glue, we'll write our data as parquet to the S3 bucket specified when we registered our database.

No alt text provided for this image


Step4: Now let's check in the S3 bucket for the "parquet" files backed by our table "page"

No alt text provided for this image

Step5: One final step to check in the Databricks Data Tab if the table if reflecting & in AWS Glue the newly created database is reflecting or not alongwiith the previous 2 databases.

In Databricks Data Tab:

No alt text provided for this image


In AWS Glue:

No alt text provided for this image
No alt text provided for this image


Wow!! That's quite a work. Lot's of settings needed to be done in AWS & Databricks to make it work. Now from here on we can easily query tables created by Databricks from different AWS services like AWS Athena without even involving Databricks Cluster.

Thats it. We are done.

This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.?

Thanks !! Happy Monday, Happy Learning !!

Abhishek Trivedi

Streaming Data Engineer @ Disney| 18K | ex-Hopscotch | DATABRICKS | SCALA | SPARK | AWS Cloud | STREAMING | AIRFLOW | KINESIS

2 年

Very detailed and informative. Bravo ??

回复
Hesham Bataineh

Machine Learning | Data engineering & Software Development

2 年

Hello, I'm trying to configure GlueCatalog with databricks, following the same steps indicated in your article. Something odd happens when I paste the instance profile created into databricks, it fails at the validattion: Verification of the instance profile failed. AWS error: You are not authorized to perform this operation. I've followed the steps using my aws root account, any solution?

回复
Ashok G.

Azure Data Architect | 2k+ Subscriber Youtuber | learning Enthusiasts

2 年

Hi Deepak, do we have any similar feature for Azure AS well?

回复
Sunil Kumar Chinni

Data Architect/Engineer| Batch & Streaming | AWS, Kafka, Python, Kubernetes, Docker, Spark, Data Modelling, Cloud Database(Redshift, Snowflake), ETL Tools

3 年

Thanks for the article Deepak Rajak. Have you seen any performance issues while querying? If yes, any resolution…

回复
Aruna Bhamidipati

Data Quality Specialist at HSBC

3 年

Thank you Deepak. Was trying with Databricks document, but could not get this working. Your post solved the issue. Good tutorial.

回复

要查看或添加评论,请登录

Deepak Rajak的更多文章

  • Multi Tasks Job in Databricks

    Multi Tasks Job in Databricks

    A job in Databricks is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or…

    3 条评论
  • Deploying Databricks on Azure

    Deploying Databricks on Azure

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    9 条评论
  • Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL is a product offering from Databricks which they are pitching against the likes of Snowflake, AWS…

    10 条评论
  • Create Tables in Databricks & Query it from AWS Athena

    Create Tables in Databricks & Query it from AWS Athena

    In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link…

    2 条评论
  • Deploying Databricks on AWS

    Deploying Databricks on AWS

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    1 条评论
  • Danny's Diner Case Study using Pyspark on Databricks

    Danny's Diner Case Study using Pyspark on Databricks

    If you are a Data guy - Analyst, Engineer or Scientist, you needed to explore some good end to end case study / project…

    9 条评论
  • Azure Cloud Data Engineering

    Azure Cloud Data Engineering

    You might have fed up enough by listening to people that the Cloud is the way forward, learn it, everything is going…

    22 条评论
  • Deploying Databricks on Google Cloud Platform

    Deploying Databricks on Google Cloud Platform

    Databricks now available on GCP as well ( Ofcourse already available in AWS & Azure ). In this ultra short article we…

    4 条评论
  • CI / CD in Azure Databricks using Azure DevOps

    CI / CD in Azure Databricks using Azure DevOps

    In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please…

    19 条评论
  • Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

    Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

    If you are a Data guy - Analyst, Engineer or Scientist, you needed to interact with Data [ Files ( different format…

    8 条评论

社区洞察

其他会员也浏览了