Data Replication Tools in Azure SQL Databases
Ananya Ghosh Chowdhury
Data and AI Architect at Microsoft | Public Speaker | Startup Advisor | Career Mentor | Harvard Business Review Advisory Council Member | Marquis Who's Who Listee | Founder @AIBoardroom
Data Replication includes data tracking, loading, streaming, and synchronizing functionalities.
Change Data Capture (CDC), SQL Data Sync, and change tracking are 3 distinct data replication tools used for tracking, loading, replicating, and streaming data changes in Azure SQL databases. Below is an explanation on the applicability of each one of these:
1.????Change Data Capture tracks and loads data changes on Azure SQL Databases:
CDC records insert, update, and delete activities that apply to a table. On SQL Server and Azure SQL Managed Instance, the SQL Server Agent runs the CDC scan and cleanup. Whereas, on Azure SQL Databases a scheduler runs capture and cleanup automatically within the database, without any external dependency for reliability or performance. Users still have an option to run capture and cleanup manually on demand.
Below steps highlight how CDC works on Azure SQL Databases:
a)??????To track Data Modification Language (DML) changes (insert, update and delete) enable CDC on the source database and tables.
b)?????Once CDC is enabled on source tables, associated change tables get created for each source table enabled.
c)??????The DML changes are reflected in the database transaction log. The CDC scan process picks up the committed changes from the log and adds them to the change table associated to the CDC-enabled source table. The commit Log Sequence Number (LSN) identifies changes that were committed in the same transaction and orders those transactions.
d)?????The cleanup process cleans the change tables on a retention-based policy set by the user. The default retention period is 3 days.
e)?????Query functions are provided to allow systematic access to the change data stored in the change tables.
The following illustration shows the principal data flow for CDC.
2.????SQL Data Sync to synchronize data on Azure SQL Databases:
SQL Data Sync is a service built on Azure SQL Database that synchronizes data bi-directionally across multiple databases, both on-premises (SQL Server) and in the cloud (Azure SQL Databases). It tracks changes using insert, update, and delete triggers and can be accessed through the Azure portal, PowerShell, or REST APIs. The changes are recorded on a side table in the user's database.
Data Sync is based around the concept of a?sync group. A sync group is a group of databases that one can synchronize using a hub and spoke topology. Within a sync group:
a)??????The hub database must be an Azure SQL Database.
b)?????The member databases can be databases in Azure SQL Database or in instances of SQL Server.
c)??????The Sync Metadata Database contains the metadata and log for Data Sync; it must be an Azure SQL Database located in the same region as the hub database.
SQL Data Sync is useful in scenarios such as Hybrid Data Synchronization, Distributed Applications or Globally Distributed applications where data must be updated across several databases in Azure SQL Database or SQL Server
领英推荐
3.????Change Tracking to track data changes on Azure SQL Databases:
Change tracking can be used as a foundation for both one-way and two-way synchronization applications.
a)??????One-way synchronization applications such as a client or mid-tier caching application, can be built to use change tracking. Here, a caching application requires data to be stored in the Database Engine and to be cached in other data stores. The application must be able to keep the cache up-to-date with any changes that have been made to the database tables and no changes are passed back to the Database Engine.
b)?????Two-way synchronization application built to use change tracking synchronizes data in an instance of the Database Engine with one or more data stores. The data in those stores can be updated and the changes must be synchronized back to the Database Engine.?
After change tracking is configured for a table, any DML statement affecting rows in the table will record change tracking information. To query the rows that have changed and to obtain information about the changes, use change-tracking functions. The value of the primary key column is the only information from the tracked table that is recorded with the change information. To obtain the latest data for changed rows, an application can use the primary key column values to join the source table with the tracked table.
How to choose the right Data Replication tool?
With a multitude of replication tools available and various use cases and business needs, it's often difficult to decide what tools to use in what specific scenarios. Below table highlights some of the common scenarios and the associated tools that could be used:
Some other considerations to have while selecting a data-replication tool are:
a)????Performance Impact
Assess the database tier size limitations before enabling replication tools since enabling change data capture (CDC) at the level of database and tables will lead to the creation of several additional system tables on the source database. Enabling replication tools also impacts log throughput. CDC's scan/cleanup are part of user workload. Performance impact can be substantial, because entire rows are added to the change tables and for updates operations pre-image is also included. This translates to a log generation that doubles (or more) when CDC is used (relative to when CDC is not enabled).
b)????Type of operation (Tracking / Loading / Streaming)
·??????CDC tracks source tables for changes and records these changes in change tables on the same source database. However, to replicate those changes to other targets other streaming tools must be integrated.
·??????SQL Data Sync automatically syncs source and target databases, either uni-directionally (SQL Server -> Azure SLQ DB) or bidirectionally (SQL Server -> Azure SQL DB -> SQL Server).
·??????Change tracking only tracks whether and which rows have been changed and the change type (insert, update, or delete).
?c)????Latency
Critical operations (like financial transactions) might have a lower appetite for latency. However, data changes for analytical workloads might have some room for latency.
Fraud Analytics | Trust and Safety | Security | Financial Crimes and Compliance | Tech Product Strategy
1 年Great article!