Unloading from AWS S3 & loading into Azure Synapse Table via Databricks

Unloading from AWS S3 & loading into Azure Synapse Table via Databricks

Here is very common scenario we have. AWS was the first public cloud & lot of people started using AWS & in particular the simple storage service. Now, if we want to move this data from AWS to Azure, we have quite a number of storage options in Microsoft Azure like Azure Blob Storage, Azure SQL database, Azure Synapse Analytics etc.

Today, we will learn how to move this data from Amazon Web Services S3 to Synapse Analytics Table.

We can achieve this via various means like Azure Data Factory, Polybase, Any other third party ETL / Data Integration tool etc but I will use Databricks for it. Simple reason for that I love Databricks :). Databricks is simple, fast but little bit expensive. ( Expense gets mitigated by supreme performance )

Anyway lets get started with todays agenda step by step.

Step1: Where is our data ? We have the S3 bucket already created & in the bucket we have uploaded some files.

Our Bucket Name - databricks1905

We will read the file - pageviews_by_second_example.tsv

No alt text provided for this image

let's jump into the Databricks Workspace now.

Step2: Mount this S3 bucket ( databricks1905) on DBFS ( Databricks File System )

Here is my article's link to mount s3 bucket into Databricks.

Step3: Read the File & Create the DataFrame

No alt text provided for this image

Step4: Synapse Analytics is having a Storage Account associated with. When you create the Synapse workspace, you can note it. We need the Storage Account Name & Access Keys to get connected with Synapse Table. ( For temp storage hoping process - It issues a COPY command from temp storage to Table ). Just navigate to Storage Account & get the Access Keys.

No alt text provided for this image

Note: Passing credentials like this not a standard process. There are various ways in Azure & Databricks by which we can manage the secrets like storage keys, sql password etc. But for our learning purpose, I am just demonstrating by passing in the Notebook itself.

Step5: Little bit of Data Engineering. Renaming one column "timestamp" to "recordedAt" & adding another column which will contain current date and naming it "loadDate"

No alt text provided for this image

Step6: Let's check some basic things in Synapse Analytics.

We already have the Dedicated Pool Running. ( See below ). I have already created it.

No alt text provided for this image

We need the complete Dedicated POOL JDBC url. Navigate to the Azure Portal & get the complete connection string.

Note: Don't forget to fill your password after copying it

No alt text provided for this image

Step7: Setup the Synapse Table Load Command.

Note: The table is not present in Synapse Analytics. Spark ( Databricks ) will create it with the Dataframe schema.

The Table Name is - "pageview" .

No alt text provided for this image

Step8: Let's Check in Synapse. We try to Query the Table - "pageview"

No alt text provided for this image

So you have seen how quickly, we have moved our data from AWS s3 to Azure Synapse Analytics Table in few steps.

Synapse Analytics is very powerful tool. The dedicated pool is one of the option it has. It also have integration with Data Factory, Power BI & Azure Purview.

Also don't forget it comes with SQL Engine & Spark Engine. May be I will try to give more insights about it in future.

This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content. 

Thanks !! Happy Weekend, Happy Learning !!

Sundarraj T

Azure Data Engineer@TechM | Ex HCL | Ex Cognizant | PySpark | HDFS | Hive | Spark | Python | Sqoop | Azure Cloud | SQL | Hadoop Ecosystem

3 年

Thanks for sharing

回复
Mohamed Aarif Ghouse

Programmer | SQL | PL/SQL | Data Warehousing| Hive|Azure ETL Developer| DataBricks | Snowflake at Hexaware Technologies

3 年

Very useful

Bilal B.

Data Engineering Manager

3 年

Thanks for sharing Deepak. Great detailed walkthrough. On the topic of choice of service, I noticed that the file you were extracting was 3.6KB. That size will be good usage for Azure Functions, speed and cost saving, dependant on plan, could also use Consumption plan so free hit. I recommended that as well as your have got small transform. So many factors go into choice of what service to use, anyways good article

Ankur Shrivastava

Associate Director | Insights and Data

3 年
回复
Ramakrishna Gunimanikala

Data Engineering @ Microsoft

3 年

Thanks . Instead of reading from Mount point , I want to read from S3 and write to synapse i.e for reading use AWS access key and secret key and while writing use azure storage account key to make authentication. Is it possible ?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了