Capture Data Changes in Azure Data Factory and Azure Synapse Analytics
In the cloud environment, efficient data integration and ETL processes can greatly improve the performance of your jobs. This is achieved by only reading the source data that has changed since the last time the pipeline was run, rather than always querying the entire dataset. Azure Data Factory (ADF) offers multiple methods for easily obtaining delta data. This article explains change data capture (CDC) in Azure Data Factory.
Change Data Capture (CDC) Resource in ADF
Getting started with CDC in ADF is made simple through the factory level Change Data Capture resource. This resource offers a configuration walk-through experience where you can point to your sources and destinations, apply optional transformations, and start your data capture. The CDC resource eliminates the need for pipeline and data flow activity design and is billed only 4 cores of General-Purpose data flows while data is being processed. You set a latency that ADF uses to wake-up and search for changed data, the only time you will be billed. The CDC resource also allows for continuous processes, as pipelines in ADF are batch only.
Native CDC in Mapping Data Flow
ADF mapping data flow automatically detects and extracts changed data, including inserted, updated, and deleted rows, from source databases using native CDC technology. This eliminates the need for timestamp or ID columns to identify changes. By linking a source transform and sink transform to a database dataset in a mapping data flow, the changes made to the source database will be automatically applied to the target database, allowing for easy synchronization between two tables. Transformations can also be added for processing delta data. When defining your sink data destination, insert, update, upsert, and delete operations can be set without the need for an Alter Row transformation.
Supported Connectors for CDC:
Auto Incremental Extraction in Mapping Data Flow
Newly updated rows or files can also be automatically detected and extracted by ADF mapping data flow from source stores. For delta data from databases, an incremental column is required to identify changes. For loading new or updated files from storage stores, ADF mapping data flow utilizes the files last modify time.
领英推荐
Supported Connectors for Auto Incremental Extraction:
Customer Managed Delta Data Extraction in Pipelines
For all ADF supported data stores, you can also build your own delta data extraction pipeline. This includes using lookup activity to obtain the watermark value stored in an external control table, using a copy activity or mapping data flow activity to query delta data against a timestamp or ID column, and using a SP activity to write the new watermark value back to the external control table for the next run. For loading new files only from storage stores, files can be deleted after successful movement to the destination, or the time partitioned folder, file names, or last modified time can be used to identify new files.
Best Practices for CDC:
Conclusion
In conclusion, the change data capture feature in Azure Data Factory and Azure Synapse Analytics provides an efficient and effective way for data integration and ETL processes. With its various options for delta data extraction, such as the Change Data Capture factory resource, native change data capture in mapping data flow, auto incremental extraction in mapping data flow, and customer-managed delta data extraction in pipeline, you can choose the best approach to meet your specific needs and requirements. By following the best practices outlined in this article, you can ensure that your data processing runs smoothly and efficiently. Whether you need to extract changes from databases or file-based storage systems, the change data capture feature in ADF and Azure Synapse Analytics provides the necessary tools and options to get the job done.
If you found this article?#informative?and?#helpful, please consider following me on LinkedIn?Akshay. I regularly post about data engineering and Azure.