登录查看更多内容

Unloading from AWS S3 & loading into Azure Synapse Table via Databricks

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

发布日期: 2021年3月13日

Here is very common scenario we have. AWS was the first public cloud & lot of people started using AWS & in particular the simple storage service. Now, if we want to move this data from AWS to Azure, we have quite a number of storage options in Microsoft Azure like Azure Blob Storage, Azure SQL database, Azure Synapse Analytics etc.

Today, we will learn how to move this data from Amazon Web Services S3 to Synapse Analytics Table.

We can achieve this via various means like Azure Data Factory, Polybase, Any other third party ETL / Data Integration tool etc but I will use Databricks for it. Simple reason for that I love Databricks :). Databricks is simple, fast but little bit expensive. ( Expense gets mitigated by supreme performance )

Anyway lets get started with todays agenda step by step.

Step1: Where is our data ? We have the S3 bucket already created & in the bucket we have uploaded some files.

Our Bucket Name - databricks1905

We will read the file - pageviews_by_second_example.tsv

let's jump into the Databricks Workspace now.

Step2: Mount this S3 bucket ( databricks1905) on DBFS ( Databricks File System )

Here is my article's link to mount s3 bucket into Databricks.

Step3: Read the File & Create the DataFrame

Step4: Synapse Analytics is having a Storage Account associated with. When you create the Synapse workspace, you can note it. We need the Storage Account Name & Access Keys to get connected with Synapse Table. ( For temp storage hoping process - It issues a COPY command from temp storage to Table ). Just navigate to Storage Account & get the Access Keys.

Note: Passing credentials like this not a standard process. There are various ways in Azure & Databricks by which we can manage the secrets like storage keys, sql password etc. But for our learning purpose, I am just demonstrating by passing in the Notebook itself.

Step5: Little bit of Data Engineering. Renaming one column "timestamp" to "recordedAt" & adding another column which will contain current date and naming it "loadDate"

Step6: Let's check some basic things in Synapse Analytics.

We already have the Dedicated Pool Running. ( See below ). I have already created it.

We need the complete Dedicated POOL JDBC url. Navigate to the Azure Portal & get the complete connection string.

Note: Don't forget to fill your password after copying it

Step7: Setup the Synapse Table Load Command.

Note: The table is not present in Synapse Analytics. Spark ( Databricks ) will create it with the Dataframe schema.

The Table Name is - "pageview" .

Step8: Let's Check in Synapse. We try to Query the Table - "pageview"

So you have seen how quickly, we have moved our data from AWS s3 to Azure Synapse Analytics Table in few steps.

Synapse Analytics is very powerful tool. The dedicated pool is one of the option it has. It also have integration with Data Factory, Power BI & Azure Purview.

Also don't forget it comes with SQL Engine & Spark Engine. May be I will try to give more insights about it in future.

This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.

Thanks !! Happy Weekend, Happy Learning !!

Sundarraj T

3 年

Thanks for sharing

Mohamed Aarif Ghouse

3 年

Very useful

1 次回应

Bilal B.

Data Engineering Manager

3 年

Thanks for sharing Deepak. Great detailed walkthrough. On the topic of choice of service, I noticed that the file you were extracting was 3.6KB. That size will be good usage for Azure Functions, speed and cost saving, dependant on plan, could also use Consumption plan so free hit. I recommended that as well as your have got small transform. So many factors go into choice of what service to use, anyways good article

1 次回应

Ankur Shrivastava

Associate Director | Insights and Data

3 年

Manish Pradhan

Ramakrishna Gunimanikala

Data Engineering @ Microsoft

3 年

Thanks . Instead of reading from Mount point , I want to read from S3 and write to synapse i.e for reading use AWS access key and secret key and while writing use azure storage account key to make authentication. Is it possible ?

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Unloading from AWS S3 & loading into Azure Synapse Table via Databricks

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

更多精彩文章

社区洞察

其他会员也浏览了

Snowflake or Databricks? BigQuery or Dataproc? Redshift or EMR?

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Sneak Peek into Trino with Azure HDInsight on AKS

Snowflake Certification Path

Azure Data and Power BI News (March 2023)

Snowflake VS Azure Synapse | 7 reasons why you should choose Snowflake OR Synapse on Azure

Ingesting, Parsing and Querying Semi Structured Data (JSON) into Snowflake Vs Databricks!!!

Microsoft Azure Data Lake

Amazon Athena - The Definitive Guide in 2022

Mastering DynamoDB: Essential Techniques for Modern Data Modeling

Multi Tasks Job in Databricks

2022年1月12日

Deploying Databricks on Azure

2022年1月10日

Databricks SQL - The new Cloud Data Ware(Lake)house

2021年11月10日

Create Tables in Databricks & Query it from AWS Athena

2021年11月8日

AWS Glue Data Catalog as the Metastore for Databricks

2021年11月1日

Deploying Databricks on AWS

2021年10月31日

Danny's Diner Case Study using Pyspark on Databricks

2021年10月6日

Azure Cloud Data Engineering

2021年6月8日

Deploying Databricks on Google Cloud Platform

2021年4月13日

CI / CD in Azure Databricks using Azure DevOps

2021年4月9日