Create Tables in Databricks & Query it from AWS Athena
Deepak Rajak
Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.
In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link to that.
In continuation to that, we would like to query the tables which we have created in Databricks from AWS Athena. It is easily possible because the Athena service has the same Glue Catalog access which we have integrated with Databricks.
So let's get started.
Step1: Login to AWS console & Navigate to Athena service.
Step2: Select the Database which we have created from Databricks - "new_db" & write a select query on our table "page"
Now this one pretty straightforward because we have created the Table backed by Parquet files. What if we create a Delta Lake table? There is slightly different way of querying the Delta Lake tables from AWS Athena.
Let's do that step by step
Step1: Register Delta Table for Use by Databricks
When we register a Delta Table against these files, Databricks will maintain the current valid version of the files using the Glue Catalog.
Step2: Generate manifests from a Delta Lake table using Databricks Runtime
Athena requires manifest files to be generated in order to query a Delta table from the Glue Catalog.
Step3: Configure Athena to read generated manifests
领英推荐
You must supply the schema in DDL.
Step4: Navigate to Athena & execute the query against "departuredelay" & "departuredelay_athena" table.
We will not get any data if we query "departuredelay" because the underlying files having "delta_log" which Athena not able to parse. So we have to query "departuredelay_athena" because we have registered this table with Manifest file.
#Note ( Very Very Important )
Remember that any schema changes in the Delta table data will be visible to operations in Databricks using?departuredelay. However, the schema changes will not be visible to Athena using?departuredelay_athena?until the table is redefined with the new schema.
One final question - Did we just duplicate our data?
No
Athena cannot currently utilize the metadata contained in the Delta logs to query the most recent state of the data. The manifest is generated from these log files, and ensures that Athena queries the valid state.
Databricks cannot query the data based on the table registered with the manifest.
Our data is not duplicated, we have just defined separate metadata to reflect the current valid state of our data. All queries and operations from either service should be directed toward their respective tables.
Thats it. We are done.
This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.?
Thanks !! Happy Monday, Happy Learning !!
Hard working and desire for success and a happy life
2 年I think we need to extend the guide a little when working with a partitioned data following the document from delta.io (https://docs.delta.io/0.5.0/presto-integration.html#step-2-configure-presto-or-athena-to-read-the-generated-manifests) Thanks Deepak.
Data Science | Data Analytics | ML | Power BI | Tableau | Business Intelligence | Cloud | Data Lake | Data Warehouse | Gen AI Enthusiast | Ex Zee 5 | Ex Axis Bank | Ex Zee
3 年Very informative, thanks Deepak.