Capture Data Changes in Azure Data Factory and Azure Synapse Analytics

Capture Data Changes in Azure Data Factory and Azure Synapse Analytics

In the cloud environment, efficient data integration and ETL processes can greatly improve the performance of your jobs. This is achieved by only reading the source data that has changed since the last time the pipeline was run, rather than always querying the entire dataset. Azure Data Factory (ADF) offers multiple methods for easily obtaining delta data. This article explains change data capture (CDC) in Azure Data Factory.

No alt text provided for this image
Change Data Capture resource in ADF

Change Data Capture (CDC) Resource in ADF

Getting started with CDC in ADF is made simple through the factory level Change Data Capture resource. This resource offers a configuration walk-through experience where you can point to your sources and destinations, apply optional transformations, and start your data capture. The CDC resource eliminates the need for pipeline and data flow activity design and is billed only 4 cores of General-Purpose data flows while data is being processed. You set a latency that ADF uses to wake-up and search for changed data, the only time you will be billed. The CDC resource also allows for continuous processes, as pipelines in ADF are batch only.

Native CDC in Mapping Data Flow

ADF mapping data flow automatically detects and extracts changed data, including inserted, updated, and deleted rows, from source databases using native CDC technology. This eliminates the need for timestamp or ID columns to identify changes. By linking a source transform and sink transform to a database dataset in a mapping data flow, the changes made to the source database will be automatically applied to the target database, allowing for easy synchronization between two tables. Transformations can also be added for processing delta data. When defining your sink data destination, insert, update, upsert, and delete operations can be set without the need for an Alter Row transformation.

Supported Connectors for CDC:

  • SAP CDC
  • Azure SQL Database
  • SQL Server
  • Azure SQL Managed Instance
  • Azure Cosmos DB (SQL API)

Auto Incremental Extraction in Mapping Data Flow

Newly updated rows or files can also be automatically detected and extracted by ADF mapping data flow from source stores. For delta data from databases, an incremental column is required to identify changes. For loading new or updated files from storage stores, ADF mapping data flow utilizes the files last modify time.

Supported Connectors for Auto Incremental Extraction:

  • Azure Blob Storage
  • ADLS Gen2
  • ADLS Gen1
  • Azure SQL Database
  • SQL Server
  • Azure SQL Managed Instance
  • Azure Database for MySQL
  • Azure Database for PostgreSQL

Customer Managed Delta Data Extraction in Pipelines

For all ADF supported data stores, you can also build your own delta data extraction pipeline. This includes using lookup activity to obtain the watermark value stored in an external control table, using a copy activity or mapping data flow activity to query delta data against a timestamp or ID column, and using a SP activity to write the new watermark value back to the external control table for the next run. For loading new files only from storage stores, files can be deleted after successful movement to the destination, or the time partitioned folder, file names, or last modified time can be used to identify new files.

Best Practices for CDC:

  • Native CDC is the simplest and recommended way to get change data, with less burden on your source database.
  • If your database store is not part of the ADF connector list with native CDC support, consider the auto incremental extraction option where you only need to input the incremental column to capture changes.
  • Customer managed delta data extraction in pipelines offers full control but covers all ADF supported databases.
  • For loading data from Azure Blob Storage, Azure Data Lake Storage Gen2,

Conclusion

In conclusion, the change data capture feature in Azure Data Factory and Azure Synapse Analytics provides an efficient and effective way for data integration and ETL processes. With its various options for delta data extraction, such as the Change Data Capture factory resource, native change data capture in mapping data flow, auto incremental extraction in mapping data flow, and customer-managed delta data extraction in pipeline, you can choose the best approach to meet your specific needs and requirements. By following the best practices outlined in this article, you can ensure that your data processing runs smoothly and efficiently. Whether you need to extract changes from databases or file-based storage systems, the change data capture feature in ADF and Azure Synapse Analytics provides the necessary tools and options to get the job done.

If you found this article?#informative?and?#helpful, please consider following me on LinkedIn?Akshay. I regularly post about data engineering and Azure.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了