Types of Change Data Capture (CDC) in ETL

Types of Change Data Capture (CDC) in ETL

Change Data Capture (CDC) is used in ETL pipelines to detect and process data changes efficiently instead of reloading entire tables. There are several types of CDC techniques, depending on the data source, performance needs, and business requirements.


1?? Timestamp-Based CDC

How it Works:

  • Relies on a last modified timestamp column (e.g., updated_at) in the source table.
  • Extracts only new or updated records by comparing timestamps.

Example Query:

SELECT * FROM source_table 
WHERE updated_at > (SELECT MAX(updated_at) FROM target_table);        

? Pros: ? Simple to implement (if timestamps exist). ? Low overhead, works well for most use cases.

? Cons: ? If a row is deleted, it won’t be detected (only works for inserts/updates). ? Timestamp drift issues if clocks are not synchronized.


2?? Incremental CDC Using Primary Key (High Watermark)

How it Works:

  • Uses a monotonically increasing column (like an ID or sequence number).
  • Extracts rows where id > last_max_id recorded in the target.

Example Query:

SELECT * FROM source_table 
WHERE id > (SELECT MAX(id) FROM target_table);        

? Pros: ? Simple and efficient for append-only tables. ? Less overhead than log-based CDC.

? Cons: ? Won’t detect updates or deletes, only new inserts. ? Needs a strict, always-increasing primary key.


3?? Log-Based CDC (Transactional CDC)

How it Works:

  • Reads database transaction logs (e.g., MySQL binlog, PostgreSQL WAL, Oracle Redo Logs).
  • Captures inserts, updates, and deletes directly from logs.

? Pros: ? Most accurate method (captures deletes & updates). ? Low impact on the source database (doesn’t require extra queries). ? Works even if timestamps aren’t reliable.

? Cons: ? Requires database-level access to transaction logs. ? Setup can be complex (e.g., Debezium, Kafka Connect). ? Some databases may not retain logs for long.


4?? Trigger-Based CDC

How it Works:

  • Uses triggers on the source table to track changes in a separate audit/log table.
  • The ETL process reads from this audit table.

? Pros: ? Can capture inserts, updates, and deletes. ? Works on databases that don’t support log-based CDC.

? Cons: ? Performance overhead (triggers slow down inserts/updates). ? Can cause contention in high-throughput systems.


5?? Snapshot-Based CDC (Full Extract & Compare)

How it Works:

  • Takes a full snapshot of the table at regular intervals.
  • Compares with the previous snapshot to find differences (inserts, updates, deletes).

? Pros: ? Works universally (no need for timestamps, logs, or triggers). ? Simple to implement with tools like Delta Lake, Iceberg, or Hudi.

? Cons: ? Expensive & slow for large tables. ? Requires extra compute & storage.


Which CDC Method Should You Use?


Real-World Use Cases

? Data Warehouses (BigQuery, Snowflake, Redshift) → Use Timestamp-Based or Log-Based CDC

? Streaming ETL (Kafka, Spark, Flink) → Use Log-Based CDC

? Slow-changing data (SCD Type 2) → Use Snapshot-Based or Log-Based CDC

? Legacy Databases (No CDC support) → Use Trigger-Based CDC


Which One Do You Want to Implement? ??



#ChangeDataCapture #CDC #ETL #DataEngineering #DataIntegration

#BigData #DataPipeline #StreamingData #RealTimeData #DatabaseManagement

#DataTransformation #DataWarehousing #DataAnalytics #CloudData #DataStreaming

image credts : theseattledataguy.com


Koenraad Block

Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance

3 周

Change Data Capture tracks data modifications in real time ?? Key types include Log-Based ??, which reads database logs for efficient tracking, Trigger-Based ??, using database triggers to capture changes, and Query-Based ??, periodically scanning tables for updates ?? Choosing the right CDC method ensures accurate and timely data synchronization ??

要查看或添加评论,请登录

Karan Nayyar的更多文章

社区洞察

其他会员也浏览了