登录查看更多内容

Create Tables in Databricks & Query it from AWS Athena

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

发布日期: 2021年11月8日

In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link to that.

In continuation to that, we would like to query the tables which we have created in Databricks from AWS Athena. It is easily possible because the Athena service has the same Glue Catalog access which we have integrated with Databricks.

So let's get started.

Step1: Login to AWS console & Navigate to Athena service.

Step2: Select the Database which we have created from Databricks - "new_db" & write a select query on our table "page"

Now this one pretty straightforward because we have created the Table backed by Parquet files. What if we create a Delta Lake table? There is slightly different way of querying the Delta Lake tables from AWS Athena.

Let's do that step by step

Step1: Register Delta Table for Use by Databricks

When we register a Delta Table against these files, Databricks will maintain the current valid version of the files using the Glue Catalog.

Step2: Generate manifests from a Delta Lake table using Databricks Runtime

Athena requires manifest files to be generated in order to query a Delta table from the Glue Catalog.

Step3: Configure Athena to read generated manifests

领英推荐

Ensuring Data Quality in Databricks with Great…

Machine Learning Reply GmbH 1 年前

The Comprehensive Guide to Apache Parquet: A…

DataTech Integrator 1 个月前

ETL with Azure Synapse

InfoFarm 2 年前

You must supply the schema in DDL.

Step4: Navigate to Athena & execute the query against "departuredelay" & "departuredelay_athena" table.

We will not get any data if we query "departuredelay" because the underlying files having "delta_log" which Athena not able to parse. So we have to query "departuredelay_athena" because we have registered this table with Manifest file.

#Note ( Very Very Important )

departuredelay: Defined on the data location. All read and write operations in Databricks must use this table. Athena?cannot?use this table for any query.
departuredelay_athena: Defined on the manifest location. All read operations from Athena must use this table. Databricks?cannot?use this table for any operations.

Remember that any schema changes in the Delta table data will be visible to operations in Databricks using?departuredelay. However, the schema changes will not be visible to Athena using?departuredelay_athena?until the table is redefined with the new schema.

One final question - Did we just duplicate our data?

Athena cannot currently utilize the metadata contained in the Delta logs to query the most recent state of the data. The manifest is generated from these log files, and ensures that Athena queries the valid state.

Databricks cannot query the data based on the table registered with the manifest.

Our data is not duplicated, we have just defined separate metadata to reflect the current valid state of our data. All queries and operations from either service should be directed toward their respective tables.

Thats it. We are done.

This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.?

Thanks !! Happy Monday, Happy Learning !!

hieu tran

Hard working and desire for success and a happy life

2 年

I think we need to extend the guide a little when working with a partitioned data following the document from delta.io (https://docs.delta.io/0.5.0/presto-integration.html#step-2-configure-presto-or-athena-to-read-the-generated-manifests) Thanks Deepak.

Praveen D'sa

3 年

Very informative, thanks Deepak.

查看更多评论

要查看或添加评论，请登录

Deepak Rajak的更多文章

Multi Tasks Job in Databricks

2022年1月12日

Multi Tasks Job in Databricks

A job in Databricks is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or…

3 条评论
Deploying Databricks on Azure

2022年1月10日

Deploying Databricks on Azure

Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

9 条评论
Databricks SQL - The new Cloud Data Ware(Lake)house

2021年11月10日

Databricks SQL - The new Cloud Data Ware(Lake)house

Databricks SQL is a product offering from Databricks which they are pitching against the likes of Snowflake, AWS…

10 条评论
AWS Glue Data Catalog as the Metastore for Databricks

2021年11月1日

AWS Glue Data Catalog as the Metastore for Databricks

We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in…

10 条评论
Deploying Databricks on AWS

2021年10月31日

Deploying Databricks on AWS

Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

1 条评论
Danny's Diner Case Study using Pyspark on Databricks

2021年10月6日

Danny's Diner Case Study using Pyspark on Databricks

If you are a Data guy - Analyst, Engineer or Scientist, you needed to explore some good end to end case study / project…

9 条评论
Azure Cloud Data Engineering

2021年6月8日

Azure Cloud Data Engineering

You might have fed up enough by listening to people that the Cloud is the way forward, learn it, everything is going…

22 条评论
Deploying Databricks on Google Cloud Platform

2021年4月13日

Deploying Databricks on Google Cloud Platform

Databricks now available on GCP as well ( Ofcourse already available in AWS & Azure ). In this ultra short article we…

4 条评论
CI / CD in Azure Databricks using Azure DevOps

2021年4月9日

CI / CD in Azure Databricks using Azure DevOps

In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please…

19 条评论
Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

2021年4月3日

Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

If you are a Data guy - Analyst, Engineer or Scientist, you needed to interact with Data [ Files ( different format…

8 条评论

See all articles

Create Tables in Databricks & Query it from AWS Athena

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

领英推荐

#Note ( Very Very Important )

Deepak Rajak的更多文章

社区洞察

其他会员也浏览了

The War of the Catalogs

Why do we use Apache Spark and Clickhouse as Foundational Engines at Datazone?

TPCH test: Databricks vs Fabric vs Snowflake

The Apache Iceberg Lakehouse: The Great Data Equalizer (disrupting the Snowflake/Databricks status quo)

Sneak Peek into Trino with Azure HDInsight on AKS

AZURE SYNAPSE VS. DATABRICKS: IS AZURE WORTH IT

Working with Semi-Structured JSON Data in Databricks

Seamless Integration: Databricks' Approach to Reading and Writing in Azure Data Lake Gen 2

Azure Data and Power BI News (January 2023)

Azure Data and Power BI News (March 2023)

领英推荐

#Note ( Very Very Important )

Deepak Rajak的更多文章

Multi Tasks Job in Databricks

Deploying Databricks on Azure

Databricks SQL - The new Cloud Data Ware(Lake)house

AWS Glue Data Catalog as the Metastore for Databricks

Deploying Databricks on AWS

Danny's Diner Case Study using Pyspark on Databricks

Azure Cloud Data Engineering

Deploying Databricks on Google Cloud Platform

CI / CD in Azure Databricks using Azure DevOps

Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

社区洞察

其他会员也浏览了

The War of the Catalogs

Why do we use Apache Spark and Clickhouse as Foundational Engines at Datazone?

TPCH test: Databricks vs Fabric vs Snowflake

The Apache Iceberg Lakehouse: The Great Data Equalizer (disrupting the Snowflake/Databricks status quo)

Sneak Peek into Trino with Azure HDInsight on AKS

AZURE SYNAPSE VS. DATABRICKS: IS AZURE WORTH IT

Working with Semi-Structured JSON Data in Databricks

Seamless Integration: Databricks' Approach to Reading and Writing in Azure Data Lake Gen 2

Azure Data and Power BI News (January 2023)

Azure Data and Power BI News (March 2023)