Create Tables in Databricks & Query it from AWS Athena

Create Tables in Databricks & Query it from AWS Athena

In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link to that.

In continuation to that, we would like to query the tables which we have created in Databricks from AWS Athena. It is easily possible because the Athena service has the same Glue Catalog access which we have integrated with Databricks.

So let's get started.

Step1: Login to AWS console & Navigate to Athena service.

No alt text provided for this image

Step2: Select the Database which we have created from Databricks - "new_db" & write a select query on our table "page"

No alt text provided for this image
No alt text provided for this image

Now this one pretty straightforward because we have created the Table backed by Parquet files. What if we create a Delta Lake table? There is slightly different way of querying the Delta Lake tables from AWS Athena.

Let's do that step by step

Step1: Register Delta Table for Use by Databricks

When we register a Delta Table against these files, Databricks will maintain the current valid version of the files using the Glue Catalog.

No alt text provided for this image

Step2: Generate manifests from a Delta Lake table using Databricks Runtime

Athena requires manifest files to be generated in order to query a Delta table from the Glue Catalog.

No alt text provided for this image

Step3: Configure Athena to read generated manifests

You must supply the schema in DDL.

No alt text provided for this image

Step4: Navigate to Athena & execute the query against "departuredelay" & "departuredelay_athena" table.

We will not get any data if we query "departuredelay" because the underlying files having "delta_log" which Athena not able to parse. So we have to query "departuredelay_athena" because we have registered this table with Manifest file.

No alt text provided for this image

#Note ( Very Very Important )

  • departuredelay: Defined on the data location. All read and write operations in Databricks must use this table. Athena?cannot?use this table for any query.
  • departuredelay_athena: Defined on the manifest location. All read operations from Athena must use this table. Databricks?cannot?use this table for any operations.

Remember that any schema changes in the Delta table data will be visible to operations in Databricks using?departuredelay. However, the schema changes will not be visible to Athena using?departuredelay_athena?until the table is redefined with the new schema.

One final question - Did we just duplicate our data?

No

Athena cannot currently utilize the metadata contained in the Delta logs to query the most recent state of the data. The manifest is generated from these log files, and ensures that Athena queries the valid state.

Databricks cannot query the data based on the table registered with the manifest.

Our data is not duplicated, we have just defined separate metadata to reflect the current valid state of our data. All queries and operations from either service should be directed toward their respective tables.

Thats it. We are done.

This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.?

Thanks !! Happy Monday, Happy Learning !!


hieu tran

Hard working and desire for success and a happy life

2 年

I think we need to extend the guide a little when working with a partitioned data following the document from delta.io (https://docs.delta.io/0.5.0/presto-integration.html#step-2-configure-presto-or-athena-to-read-the-generated-manifests) Thanks Deepak.

回复
Praveen D'sa

Data Science | Data Analytics | ML | Power BI | Tableau | Business Intelligence | Cloud | Data Lake | Data Warehouse | Gen AI Enthusiast | Ex Zee 5 | Ex Axis Bank | Ex Zee

3 年

Very informative, thanks Deepak.

回复

要查看或添加评论,请登录

Deepak Rajak的更多文章

  • Multi Tasks Job in Databricks

    Multi Tasks Job in Databricks

    A job in Databricks is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or…

    3 条评论
  • Deploying Databricks on Azure

    Deploying Databricks on Azure

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    9 条评论
  • Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL is a product offering from Databricks which they are pitching against the likes of Snowflake, AWS…

    10 条评论
  • AWS Glue Data Catalog as the Metastore for Databricks

    AWS Glue Data Catalog as the Metastore for Databricks

    We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in…

    10 条评论
  • Deploying Databricks on AWS

    Deploying Databricks on AWS

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    1 条评论
  • Danny's Diner Case Study using Pyspark on Databricks

    Danny's Diner Case Study using Pyspark on Databricks

    If you are a Data guy - Analyst, Engineer or Scientist, you needed to explore some good end to end case study / project…

    9 条评论
  • Azure Cloud Data Engineering

    Azure Cloud Data Engineering

    You might have fed up enough by listening to people that the Cloud is the way forward, learn it, everything is going…

    22 条评论
  • Deploying Databricks on Google Cloud Platform

    Deploying Databricks on Google Cloud Platform

    Databricks now available on GCP as well ( Ofcourse already available in AWS & Azure ). In this ultra short article we…

    4 条评论
  • CI / CD in Azure Databricks using Azure DevOps

    CI / CD in Azure Databricks using Azure DevOps

    In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please…

    19 条评论
  • Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

    Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

    If you are a Data guy - Analyst, Engineer or Scientist, you needed to interact with Data [ Files ( different format…

    8 条评论

社区洞察

其他会员也浏览了