AWS Glue Data Catalog as the Metastore for Databricks
Deepak Rajak
Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.
We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in replacement for a Hive metastore.
Each AWS account owns a single catalog in an AWS region whose catalog ID is the same as the AWS account ID. Using the Glue Catalog as the metastore for Databricks can potentially enable a shared metastore across AWS services, applications, or AWS accounts.
We can configure multiple Databricks workspaces to share the same metastore.
This article shows how to securely access a Glue Data Catalog in Databricks using instance profiles.
Lets Create a Database in AWS Glue
Step1: Login to AWS account & Navigate to AWS Glue service
Step2: Click on Databases & add a database ( I named it - myfirstdatabase)
Our database is created.
Now the idea to populate & access this database from Databricks. exciting ? Let's work on it.
Step1: Create an instance profile to access a Glue Data Catalog
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GrantCatalogAccessToGlue",
"Effect": "Allow",
"Action": [
"glue:BatchCreatePartition",
"glue:BatchDeletePartition",
"glue:BatchGetPartition",
"glue:CreateDatabase",
"glue:CreateTable",
"glue:CreateUserDefinedFunction",
"glue:DeleteDatabase",
"glue:DeletePartition",
"glue:DeleteTable",
"glue:DeleteUserDefinedFunction",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTables",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:UpdateDatabase",
"glue:UpdatePartition",
"glue:UpdateTable",
"glue:UpdateUserDefinedFunction"
],
"Resource": [
"*"
]
}
]
}
{Step4: Look up the IAM role used to create the Databricks deploymen
Step3: Look up the IAM role used to create the Databricks deployment ( Need to be done in Databricks Account )
Step4: Add the Glue Catalog instance profile to the EC2 policy ( Need to be done in AWS Account )
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1403287045000",
"Effect": "Allow",
"Action": [
"ec2:AssociateDhcpOptions",
"ec2:AssociateIamInstanceProfile",
"ec2:AssociateRouteTable",
"ec2:AttachInternetGateway",
"ec2:AttachVolume",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CancelSpotInstanceRequests",
"ec2:CreateDhcpOptions",
"ec2:CreateInternetGateway",
"ec2:CreatePlacementGroup",
"ec2:CreateRoute",
"ec2:CreateSecurityGroup",
"ec2:CreateSubnet",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:CreateVpc",
"ec2:CreateVpcPeeringConnection",
"ec2:DeleteInternetGateway",
"ec2:DeletePlacementGroup",
"ec2:DeleteRoute",
"ec2:DeleteRouteTable",
"ec2:DeleteSecurityGroup",
"ec2:DeleteSubnet",
"ec2:DeleteTags",
"ec2:DeleteVolume",
"ec2:DeleteVpc",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeIamInstanceProfileAssociations",
"ec2:DescribeInstanceStatus",
"ec2:DescribeInstances",
"ec2:DescribePlacementGroups",
"ec2:DescribePrefixLists",
"ec2:DescribeReservedInstancesOfferings",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSpotInstanceRequests",
"ec2:DescribeSpotPriceHistory",
"ec2:DescribeSubnets",
"ec2:DescribeVolumes",
"ec2:DescribeVpcs",
"ec2:DetachInternetGateway",
"ec2:DisassociateIamInstanceProfile",
"ec2:ModifyVpcAttribute",
"ec2:ReplaceIamInstanceProfileAssociation",
"ec2:RequestSpotInstances",
"ec2:RevokeSecurityGroupEgress",
"ec2:RevokeSecurityGroupIngress",
"ec2:RunInstances",
"ec2:TerminateInstances"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-glue-access>"
}
]
}
{
Step5: Add the Glue Catalog instance profile to a Databricks workspace ( Need to be done in Databricks Workspace )
Step6: Launch a cluster with the Glue Catalog instance profile
领英推荐
Step7: Configure Glue Data Catalog as the metastore
To enable Glue Catalog integration, spark.databricks.hive.metastore.glueCatalog.enabled?true. This configuration is disabled by default. That is, the default is to use the Databricks hosted Hive metastore, or some other external metastore if configured.
For interactive or jobs clusters, set the configuration in the cluster configuration?before?cluster startup.
Cluster is now up & running
Now check if we can access the Glue Catalog by use the below command
Let's check in the Glue Console if we have the same databases. ( We know we have because we have created the "myfirstdatabase" in our first step.
Wow!! This is what we were waiting for. We have the setup done. We have configured the External catalog for Databricks. Amazing , isn't it ?
Now let's create a Database & Table from Databricks & Check if it is reflecting in Glue Catalog.
All below steps need to be done in the Databricks Notebook.
Step1: We'll create a database called?glue_db. We'll register this to the same S3 bucket ( I have created a bucket - s3bucket1905 )
Setup AWSKeys ( Ideally we should be doing by instance profile or by secret scope but for now I am directly providing AWS Access Key & AWS Secret Key )
Step2: Read From File & Write into a Table
Step3: When using "saveAsTable" with Glue, we'll write our data as parquet to the S3 bucket specified when we registered our database.
Step4: Now let's check in the S3 bucket for the "parquet" files backed by our table "page"
Step5: One final step to check in the Databricks Data Tab if the table if reflecting & in AWS Glue the newly created database is reflecting or not alongwiith the previous 2 databases.
In Databricks Data Tab:
In AWS Glue:
Wow!! That's quite a work. Lot's of settings needed to be done in AWS & Databricks to make it work. Now from here on we can easily query tables created by Databricks from different AWS services like AWS Athena without even involving Databricks Cluster.
Thats it. We are done.
This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.?
Thanks !! Happy Monday, Happy Learning !!
Streaming Data Engineer @ Disney| 18K | ex-Hopscotch | DATABRICKS | SCALA | SPARK | AWS Cloud | STREAMING | AIRFLOW | KINESIS
2 年Very detailed and informative. Bravo ??
Machine Learning | Data engineering & Software Development
2 年Hello, I'm trying to configure GlueCatalog with databricks, following the same steps indicated in your article. Something odd happens when I paste the instance profile created into databricks, it fails at the validattion: Verification of the instance profile failed. AWS error: You are not authorized to perform this operation. I've followed the steps using my aws root account, any solution?
Azure Data Architect | 2k+ Subscriber Youtuber | learning Enthusiasts
2 年Hi Deepak, do we have any similar feature for Azure AS well?
Data Architect/Engineer| Batch & Streaming | AWS, Kafka, Python, Kubernetes, Docker, Spark, Data Modelling, Cloud Database(Redshift, Snowflake), ETL Tools
3 年Thanks for the article Deepak Rajak. Have you seen any performance issues while querying? If yes, any resolution…
Data Quality Specialist at HSBC
3 年Thank you Deepak. Was trying with Databricks document, but could not get this working. Your post solved the issue. Good tutorial.