How to Simple Scale ETL with Azure Data Factory and Azure Data Bricks

How to Simple Scale ETL with Azure Data Factory and Azure Data Bricks

Using data lakes, organizations may gain timely and safe access to a wide range of data sources, allowing them to deliver value and insight on a continuous basis. As a first step, robust data pipeline orchestration and automation are needed. As the amount, variety, and speed of data rise, so does the demand for data extraction, transformation, and loading (ETL) pipelines that are both reliable and secure.

?Azure Databricks is the fastest-growing Data & AI service on Microsoft Azure today, processing over two exabytes (2 billion gigabytes) of data each month. Customer data intake pipelines may be simplified and scaled thanks to the strong connectivity between Azure Databricks and other Azure services. For instance, integrating with Azure Active Directory (Azure AD) allows for consistent cloud-based management of identity and access. Integration with Azure Data Lake Storage (ADLS) and Azure Data Factory (ADF) allows for hybrid data integration to facilitate ETL at scale while also providing highly scalable and secure storage for big data analytics.

No alt text provided for this image

With a single workflow, you may connect, import, and transform data

?All of your data sources may be ingested into a single data lake with ADF, which offers 90+ built-in data source connectors. Many of these capabilities are included in ADF's built-in workflow control and pipeline scheduling as well as its data transformation and integration. Azure Databricks and Delta Lake can be used to transform raw data into Bronze, Silver, and Gold tables for clients. One common application of ADF with Azure Databricks Delta Lake is to enable SQL queries on data lakes and to construct data pipelines for machine learning.

No alt text provided for this image

Azure Data Factory with Azure Databricks: Getting Started Guide

?To use Azure Data Factory to run an Azure Databricks notebook, go to the Azure portal, search for “Data factories”, and click “create”.

No alt text provided for this image

Give your data factory a distinctive name, then pick a subscription, and lastly decide on a resource group and area. Create a new account.

No alt text provided for this image

The new data factory can be viewed by clicking the "Go to resource" button after it has been established.

No alt text provided for this image

The Data Factory user interface may now be accessed by clicking the "Author & Monitor" tile.

No alt text provided for this image

Go back to the "Let's get started" page of Azure Data Factory and click the "Author" button on the left side.

No alt text provided for this image

Click on "Connections" at the bottom of your screen, then "New," to begin.

No alt text provided for this image

Select "Azure Databricks" from the "Compute" tab in the "New linked service" pane, then click "Continue".

No alt text provided for this image

Specify a name for the associated service and select a workspace for it.

No alt text provided for this image

Select "User settings" on your Azure Databricks workspace, then click "Create an access token" from the drop-down menu.

No alt text provided for this image

Select "Generate New Token" from the drop-down menu.

No alt text provided for this image

Take note of whatever version of Python and clustering software you'd like to use. Click "Create" after you've gone through everything one more time.

No alt text provided for this image

Finally, a pipeline may be built thanks to the linked service. Select "Pipeline" from the Azure Data Factory UI by clicking the plus (+) icon.

No alt text provided for this image

By clicking on the "Parameters" tab and then the addition (+) button, you can add a new parameter.

No alt text provided for this image

Expand the "Databricks" activity and then drag and drop a Databricks notebook onto the pipeline design canvas to add it to the pipeline.

No alt text provided for this image

Select the "Azure Databricks" option and select the associated service created above to connect to the Azure Databricks workspace. Next, go to the "Settings" tab and enter the notebook's location. To publish to the ADF service, click "Validate" and then "Publish All."

No alt text provided for this image
No alt text provided for this image

Once submitted, click "Add Trigger | Trigger immediately" to begin a pipeline run.

No alt text provided for this image

To begin a pipeline run, review the parameters and then click "Finish."

No alt text provided for this image

To observe how the pipeline is progressing, turn to the "Monitor" tab on the left-hand panel.

No alt text provided for this image

Custom ETL code can be parameterized and operationalized with ease using Azure Databricks notebooks integrated into your Azure Data Factory pipelines. See this ADF blog article and this ADF tutorial for more information on how Azure Databricks connects with ADF. See this webinar, Using SQL to Query Your Data Lake with Delta Lake, to learn more about exploring and querying your data lake.

















要查看或添加评论,请登录

社区洞察

其他会员也浏览了