登录查看更多内容

Azure Data Factory

Shruti Anand

Associate Consultant at HUQUO

发布日期: 2024年10月26日

Data Compression: During the Data Copy activity, it is possible to compress the data and write the compressed data to the target data source. This feature helps optimize bandwidth usage in data copying.

Extensive Connectivity Support for Different Data Sources: Azure Data Factory provides broad connectivity support for connecting to different data sources. This is useful when you want to pull or write data from different data sources.

Custom Event Triggers: Azure Data Factory allows you to automate data processing using custom event triggers. This feature allows you to automatically execute a certain action when a certain event occurs.

Data Preview and Validation: During the Data Copy activity, tools are provided for previewing and validating data. This feature helps you ensure that data is copied correctly and written to the target data source correctly.

Customizable Data Flows: Azure Data Factory allows you to create customizable data flows. This feature allows you to add custom actions or steps for data processing.

Integrated Security: Azure Data Factory offers integrated security features such as Entra ID integration and role-based access control to control access to dataflows. This feature increases security in data processing and protects your data.

Usage scenarios

For example, imagine a gaming company that collects petabytes of game logs that are produced by games in the cloud. The company wants to analyze these logs to gain insights into customer preferences, demographics, and usage behavior. It also wants to identify up-sell and cross-sell opportunities, develop compelling new features, drive business growth, and provide a better experience to its customers.

To analyze these logs, the company needs to use reference data such as customer information, game information, and marketing campaign information that is in an on-premises data store. The company wants to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud data store.

To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight), and publish the transformed data into a cloud data warehouse such as Azure Synapse Analytics to easily build a report on top of it. They want to automate this workflow, and monitor and manage it on a daily schedule. They also want to execute it when files land in a blob store container.

Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

Additionally, you can publish your transformed data to data stores such as Azure Synapse Analytics for business intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized into meaningful data stores and data lakes for better business decisions.

How does it work?

Data Factory contains a series of interconnected systems that provide a complete end-to-end platform for data engineers.

This visual guide provides a detailed overview of the complete Data Factory architecture:

To see more detail, select the preceding image to zoom in, or browse to the high resolution image.

Connect and collect

Enterprises have data of various types that are located in disparate sources on-premises, in the cloud, structured, unstructured, and semi-structured, all arriving at different intervals and speeds.

领英推荐

5 Data Analytics Challenges Companies Will Face in 2021

Benjamin Rogojan 3 年前

90-Day Journal Of An Enterprise Architect In Big Data…

Vintage Global 5 个月前

Warping through Data pipelines

Mathias Halkj?r Petersen 1 年前

The first step in building an information production system is to connect to all the required sources of data and processing, such as software-as-a-service (SaaS) services, databases, file shares, and FTP web services. The next step is to move the data as needed to a centralized location for subsequent processing.

Without Data Factory, enterprises must build custom data movement components or write custom services to integrate these data sources and processing. It's expensive and hard to integrate and maintain such systems. In addition, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully managed service can offer.

With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis. For example, you can collect data in Azure Data Lake Storage and transform the data later by using an Azure Data Lake Analytics compute service. You can also collect data in Azure Blob storage and transform it later by using an Azure HDInsight Hadoop cluster.

Transform and enrich

After data is present in a centralized data store in the cloud, process or transform the collected data by using ADF mapping data flows. Data flows enable data engineers to build and maintain data transformation graphs that execute on Spark without needing to understand Spark clusters or Spark programming.

If you prefer to code transformations by hand, ADF supports external activities for executing your transformations on compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.

CI/CD and publish

Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to incrementally develop and deliver your ETL processes before publishing the finished product. After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure Cosmos DB, or whichever analytics engine your business users can point to from their business intelligence tools.

Monitor

After you have successfully built and deployed your data integration pipeline, providing business value from refined data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.

Top-level concepts

An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data Factory is composed of the following key components:

Pipelines
Activities
Datasets
Linked services
Data Flows
Integration Runtimes

These components work together to provide the platform on which you can compose data-driven workflows with steps to move and transform data.

Pipeline

A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the data.

The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each one individually. The activities in a pipeline can be chained together to operate sequentially, or they can operate independently in parallel.

要查看或添加评论，请登录

Shruti Anand的更多文章

Collection Modeling

2025年3月20日

Collection Modeling

Understanding Collection Collection refers to the systematic and organized effort to collect past due payments from…
What Is the Difference Between Inbound and Outbound

2025年3月19日

What Is the Difference Between Inbound and Outbound

Typically, a place that maps more incoming calls is called an inbound call center. On the other hand, centers that make…
What Is Procurement Data Management?

2025年3月18日

What Is Procurement Data Management?

Procurement data management is the process of collecting, organizing, and managing all information related to the…
Data Visualization

2025年3月17日

Data Visualization

Data visualization is the graphical representation of information and data. By using visual elements like charts…
What is Metadata?

2025年3月13日

What is Metadata?

Often referred to as data that describes other data, metadata is structured reference data that helps to sort and…
What Is Loss Given Default (LGD)?

2025年3月12日

What Is Loss Given Default (LGD)?

Loss given default (LGD) is the estimated amount of money a bank or other financial institution loses when a borrower…
Tableau

2025年3月10日

Tableau

Tableau helps people and organizations be more data-driven As the market-leading choice for modern business…
What is Kubernetes?

2025年3月8日

What is Kubernetes?

Kubernetes, also known as k8s or kube, is an open source container orchestration platform for scheduling and automating…
What is Data Visualization?

2025年3月7日

What is Data Visualization?

Data visualization is the graphical representation of information and data. By using visual elements like charts…
What is PySpark?

2025年3月6日

What is PySpark?

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for…

See all articles

Azure Data Factory

Shruti Anand

Associate Consultant at HUQUO

Usage scenarios

How does it work?

Connect and collect

领英推荐

Transform and enrich

CI/CD and publish

Monitor

Top-level concepts

Pipeline

Shruti Anand的更多文章

社区洞察

其他会员也浏览了

Warping through Data pipelines

Snowflake Tables: Revolutionizing Data Management for Modern Businesses

Topic- The Top of the Best Practices to Implement in Big Data Platforms

Big Data Platforms vs. Traditional Data Warehousing: What Are the Real Differences?

An Approach to Architecting a Lower Cost, Fast and Self-Service Data Lakehouse

Overcoming Difficulties in Modern Big Data Analysis for Business: Strategies and Implications

Driving Business Growth with Big Data: SynapseIndia's Comprehensive Solutions and Strategies

How Dremio Simplifies Data Lakehouse Architecture for Modern Analytics

Data Lakes: Enterprise Treasure Trove !

A serious word about Data Democratization

Usage scenarios

How does it work?

Connect and collect

领英推荐

Transform and enrich

CI/CD and publish

Monitor

Top-level concepts

Pipeline

Shruti Anand的更多文章

Collection Modeling

What Is the Difference Between Inbound and Outbound

What Is Procurement Data Management?

Data Visualization

What is Metadata?

What Is Loss Given Default (LGD)?

Tableau

What is Kubernetes?

What is Data Visualization?

What is PySpark?

社区洞察

其他会员也浏览了

Warping through Data pipelines

Snowflake Tables: Revolutionizing Data Management for Modern Businesses

Topic- The Top of the Best Practices to Implement in Big Data Platforms

Big Data Platforms vs. Traditional Data Warehousing: What Are the Real Differences?

An Approach to Architecting a Lower Cost, Fast and Self-Service Data Lakehouse

Overcoming Difficulties in Modern Big Data Analysis for Business: Strategies and Implications

Driving Business Growth with Big Data: SynapseIndia's Comprehensive Solutions and Strategies

How Dremio Simplifies Data Lakehouse Architecture for Modern Analytics

Data Lakes: Enterprise Treasure Trove !

A serious word about Data Democratization