How to Simple Scale ETL with Azure Data Factory and Azure Data Bricks
Using data lakes, organizations may gain timely and safe access to a wide range of data sources, allowing them to deliver value and insight on a continuous basis. As a first step, robust data pipeline orchestration and automation are needed. As the amount, variety, and speed of data rise, so does the demand for data extraction, transformation, and loading (ETL) pipelines that are both reliable and secure.
?Azure Databricks is the fastest-growing Data & AI service on Microsoft Azure today, processing over two exabytes (2 billion gigabytes) of data each month. Customer data intake pipelines may be simplified and scaled thanks to the strong connectivity between Azure Databricks and other Azure services. For instance, integrating with Azure Active Directory (Azure AD) allows for consistent cloud-based management of identity and access. Integration with Azure Data Lake Storage (ADLS) and Azure Data Factory (ADF) allows for hybrid data integration to facilitate ETL at scale while also providing highly scalable and secure storage for big data analytics.
With a single workflow, you may connect, import, and transform data
?All of your data sources may be ingested into a single data lake with ADF, which offers 90+ built-in data source connectors. Many of these capabilities are included in ADF's built-in workflow control and pipeline scheduling as well as its data transformation and integration. Azure Databricks and Delta Lake can be used to transform raw data into Bronze, Silver, and Gold tables for clients. One common application of ADF with Azure Databricks Delta Lake is to enable SQL queries on data lakes and to construct data pipelines for machine learning.
Azure Data Factory with Azure Databricks: Getting Started Guide
?To use Azure Data Factory to run an Azure Databricks notebook, go to the Azure portal, search for “Data factories”, and click “create”.
Give your data factory a distinctive name, then pick a subscription, and lastly decide on a resource group and area. Create a new account.
The new data factory can be viewed by clicking the "Go to resource" button after it has been established.
The Data Factory user interface may now be accessed by clicking the "Author & Monitor" tile.
Go back to the "Let's get started" page of Azure Data Factory and click the "Author" button on the left side.
Click on "Connections" at the bottom of your screen, then "New," to begin.
Select "Azure Databricks" from the "Compute" tab in the "New linked service" pane, then click "Continue".
Specify a name for the associated service and select a workspace for it.
Select "User settings" on your Azure Databricks workspace, then click "Create an access token" from the drop-down menu.
Select "Generate New Token" from the drop-down menu.
Take note of whatever version of Python and clustering software you'd like to use. Click "Create" after you've gone through everything one more time.
Finally, a pipeline may be built thanks to the linked service. Select "Pipeline" from the Azure Data Factory UI by clicking the plus (+) icon.
领英推荐
By clicking on the "Parameters" tab and then the addition (+) button, you can add a new parameter.
Expand the "Databricks" activity and then drag and drop a Databricks notebook onto the pipeline design canvas to add it to the pipeline.
Select the "Azure Databricks" option and select the associated service created above to connect to the Azure Databricks workspace. Next, go to the "Settings" tab and enter the notebook's location. To publish to the ADF service, click "Validate" and then "Publish All."
Once submitted, click "Add Trigger | Trigger immediately" to begin a pipeline run.
To begin a pipeline run, review the parameters and then click "Finish."
To observe how the pipeline is progressing, turn to the "Monitor" tab on the left-hand panel.
Custom ETL code can be parameterized and operationalized with ease using Azure Databricks notebooks integrated into your Azure Data Factory pipelines. See this ADF blog article and this ADF tutorial for more information on how Azure Databricks connects with ADF. See this webinar, Using SQL to Query Your Data Lake with Delta Lake, to learn more about exploring and querying your data lake.