Unloading from AWS S3 & loading into Azure Synapse Table via Databricks
Deepak Rajak
Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.
Here is very common scenario we have. AWS was the first public cloud & lot of people started using AWS & in particular the simple storage service. Now, if we want to move this data from AWS to Azure, we have quite a number of storage options in Microsoft Azure like Azure Blob Storage, Azure SQL database, Azure Synapse Analytics etc.
Today, we will learn how to move this data from Amazon Web Services S3 to Synapse Analytics Table.
We can achieve this via various means like Azure Data Factory, Polybase, Any other third party ETL / Data Integration tool etc but I will use Databricks for it. Simple reason for that I love Databricks :). Databricks is simple, fast but little bit expensive. ( Expense gets mitigated by supreme performance )
Anyway lets get started with todays agenda step by step.
Step1: Where is our data ? We have the S3 bucket already created & in the bucket we have uploaded some files.
Our Bucket Name - databricks1905
We will read the file - pageviews_by_second_example.tsv
let's jump into the Databricks Workspace now.
Step2: Mount this S3 bucket ( databricks1905) on DBFS ( Databricks File System )
Here is my article's link to mount s3 bucket into Databricks.
Step3: Read the File & Create the DataFrame
Step4: Synapse Analytics is having a Storage Account associated with. When you create the Synapse workspace, you can note it. We need the Storage Account Name & Access Keys to get connected with Synapse Table. ( For temp storage hoping process - It issues a COPY command from temp storage to Table ). Just navigate to Storage Account & get the Access Keys.
Note: Passing credentials like this not a standard process. There are various ways in Azure & Databricks by which we can manage the secrets like storage keys, sql password etc. But for our learning purpose, I am just demonstrating by passing in the Notebook itself.
Step5: Little bit of Data Engineering. Renaming one column "timestamp" to "recordedAt" & adding another column which will contain current date and naming it "loadDate"
Step6: Let's check some basic things in Synapse Analytics.
We already have the Dedicated Pool Running. ( See below ). I have already created it.
We need the complete Dedicated POOL JDBC url. Navigate to the Azure Portal & get the complete connection string.
Note: Don't forget to fill your password after copying it
Step7: Setup the Synapse Table Load Command.
Note: The table is not present in Synapse Analytics. Spark ( Databricks ) will create it with the Dataframe schema.
The Table Name is - "pageview" .
Step8: Let's Check in Synapse. We try to Query the Table - "pageview"
So you have seen how quickly, we have moved our data from AWS s3 to Azure Synapse Analytics Table in few steps.
Synapse Analytics is very powerful tool. The dedicated pool is one of the option it has. It also have integration with Data Factory, Power BI & Azure Purview.
Also don't forget it comes with SQL Engine & Spark Engine. May be I will try to give more insights about it in future.
This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.
Thanks !! Happy Weekend, Happy Learning !!
Azure Data Engineer@TechM | Ex HCL | Ex Cognizant | PySpark | HDFS | Hive | Spark | Python | Sqoop | Azure Cloud | SQL | Hadoop Ecosystem
3 年Thanks for sharing
Programmer | SQL | PL/SQL | Data Warehousing| Hive|Azure ETL Developer| DataBricks | Snowflake at Hexaware Technologies
3 年Very useful
Data Engineering Manager
3 年Thanks for sharing Deepak. Great detailed walkthrough. On the topic of choice of service, I noticed that the file you were extracting was 3.6KB. That size will be good usage for Azure Functions, speed and cost saving, dependant on plan, could also use Consumption plan so free hit. I recommended that as well as your have got small transform. So many factors go into choice of what service to use, anyways good article
Associate Director | Insights and Data
3 年Manish Pradhan
Data Engineering @ Microsoft
3 年Thanks . Instead of reading from Mount point , I want to read from S3 and write to synapse i.e for reading use AWS access key and secret key and while writing use azure storage account key to make authentication. Is it possible ?