Azure Data Factory

Azure Data Factory

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. ADF does not store any data itself. It allows you to create data-driven workflows to orchestrate the movement of data between supported data stores and then process the data using compute services in other regions or in an on-premise environment. It also allows you to monitor and manage workflows using both programmatic and UI mechanisms. Azure Data Factory is Azure's cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. You can also lift and shift existing SSIS packages to Azure and run them with full compatibility in ADF.

Azure Data Factory use case

?ADF can be used for:?

  • Supporting data migrations
  • Getting data from a client’s server or online data to an Azure Data Lake
  • Carrying out various data integration processes
  • Integrating data from different ERP systems and loading it into Azure Synapse for reporting

How does Azure Data Factory work?

?The Data Factory service allows you to create data pipelines that move and transform data and then run the pipelines on a specified schedule (hourly, daily, weekly, etc.). This means the data that is consumed and produced by workflows is time-sliced data, and we can specify the pipeline mode as scheduled (once a day) or one time.

Azure Data Factory pipelines (data-driven workflows) typically perform three steps.

Step 1: Connect and Collect

Connect to all the required sources of data and processing such as SaaS services, file shares, FTP, and web services. Then, move the data as needed to a centralized location for subsequent processing by using the Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis.

Step 2: Transform and Enrich

Once data is present in a centralized data store in the cloud, it is transformed using compute services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Machine Learning.

Step 3: Publish

Deliver transformed data from the cloud to on-premises sources like SQL Server or keep it in your cloud storage sources for consumption by BI and analytics tools and other applications.

Data migration activities with Azure Data Factory

?By using Microsoft Azure Data Factory, data migration occurs between two cloud data stores and between an on-premises data store and a cloud data store.

Copy Activity in Azure Data Factory copies data from a source data store to a sink data store. Azure supports various data stores such as source or sink data stores like Azure Blob storage, Azure Cosmos DB (Document DB API), Azure Data Lake Store, Oracle, Cassandra, etc. For more information about Azure Data Factory supported data stores for data movement activities, refer to Azure documentation for data movement activities. Azure Data Factory supports transformation activities such as Hive, MapReduce, Spark, etc. that can be added to pipelines either individually or chained with other activities.

Key components of Azure Data Factory Pipeline

  • Data Sources: Azure Data Factory Pipeline supports a variety of data sources, including on-premises, cloud-based, and big data sources. These sources can be connected to the pipeline to extract data for processing.
  • Data Destinations: Once the data is extracted, it needs to be loaded into a destination system. Azure Data Factory Pipeline supports a variety of data destinations, such as Azure SQL Database, Azure Blob Storage, and Azure Data Lake Storage.
  • Activities: Activities are the building blocks of an Azure Data Factory Pipeline. They are used to define the steps that need to be performed on the data, such as copying data, executing a stored procedure, or transforming data.
  • Data Flows: Data flows are used to transform data within the pipeline. They provide a visual interface for defining data transformations, such as mapping, filtering, and aggregation.
  • Triggers: Triggers are used to schedule the execution of a pipeline. They allow you to define the frequency and start time of the pipeline execution.
  • Integration Runtimes: Integration runtimes are used to connect to the data sources and destinations. They provide a secure and efficient way to move data between different systems.
  • Monitoring: Azure Data Factory Pipeline provides real-time monitoring capabilities that allow you to track the progress of your pipeline execution.?

How to Create a Pipeline in Azure Data Factory?

Creating a pipeline in Azure Data Factory involves a few simple steps. By following these steps, you can build and execute data integration workflows to move and transform data between various sources and destinations.

  • Create a new data factory: To create a new data factory, go to the Azure portal and search for “Data Factory”. Click on “Add” and follow the prompts to create a new data factory.
  • Create a new pipeline: Once you have created a new data factory, click on “Author & Monitor” to open the Azure Data Factory UI. Click on the “Author” tab and then select “New pipeline” to create a new pipeline.
  • Add an activity: Once you have created a new pipeline, you can add activities to it. To add an activity, drag and drop it from the toolbox onto the pipeline canvas.
  • ?Configure the activity: Once you have added an activity to the pipeline, you need to configure it. This involves defining the input and output datasets, as well as any other settings required for the activity.
  • Add more activities: You can add as many activities as you need to the pipeline. You can also use control flow activities, such as conditional statements and loops, to create more complex workflows.
  • Publish the pipeline: Once you have finished building the pipeline, click on “Publish all” to save and publish the changes.
  • Trigger the pipeline: To execute the pipeline, you need to trigger it. You can do this manually by clicking on “Trigger now” or you can set up a trigger to run the pipeline on a schedule or in response to an event.
  • Monitor the pipeline: Once the pipeline is running, you can monitor its progress by clicking on “Monitor & Manage”. This will show you the status of the pipeline and any errors or warnings that occur.

?Advantages of Azure Data Factory Pipeline

  • Scalability: Azure Data Factory Pipeline can easily scale up or down based on your data integration needs. You can add or remove resources as required and pay only for the resources you use.
  • Integration: Azure Data Factory enables you to integrate with various data sources, including on-premises and cloud-based systems. This integration is possible due to connectors that support a variety of data sources including Azure Blob Storage and SQL Database
  • Automation: Azure Data Factory Pipeline enables you to automate your data integration workflows, reducing manual intervention and allowing you to focus on other critical tasks.
  • Flexibility: Azure Data Factory Pipeline supports both code-based and GUI-based workflows, giving you the flexibility to choose the approach that works best for your needs.
  • Security: Azure Data Factory Pipeline implements robust security measures to protect your data, including encryption at rest and in transit, role-based access control (RBAC), and more.
  • Monitoring: Azure Data Factory Pipeline provides comprehensive monitoring and logging capabilities, allowing you to monitor pipeline executions, diagnose issues, and optimize performance.
  • Cost-Effective: Azure Data Factory Pipeline is a cost-effective data integration solution, as it supports pay-as-you-go pricing, which means you only pay for the resources you use.

Disadvantages of Azure Data Factory Pipeline

  • You can only create pipelines from notebooks, not from the command-line interface.
  • Pipelines do not include all of the functions available in the R language.
  • Pipelines are available only in the Enterprise Edition of Azure Databricks.
  • Pipelines can only run on Apache Spark and not other libraries like TensorFlow.
  • Pipelines are limited in terms of their ability to perform complex tasks like data cleansing, transformations and ETL integration.
  • You cannot run a pipeline on an existing dataset but must instead create a new one.
  • Pipelines are limited to one processor core, while notebooks can be assigned up to 128 cores.
  • Pipelines don’t support Python or Scala.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了