Types of Change Data Capture (CDC) in ETL
Karan Nayyar
Data Engineer at Kmart Group AU | Big Data | Snowflake | AWS | Py-Spark | SQL | Python | Kafka , IBM Certified Big Data Engineer
Change Data Capture (CDC) is used in ETL pipelines to detect and process data changes efficiently instead of reloading entire tables. There are several types of CDC techniques, depending on the data source, performance needs, and business requirements.
1?? Timestamp-Based CDC
How it Works:
Example Query:
SELECT * FROM source_table
WHERE updated_at > (SELECT MAX(updated_at) FROM target_table);
? Pros: ? Simple to implement (if timestamps exist). ? Low overhead, works well for most use cases.
? Cons: ? If a row is deleted, it won’t be detected (only works for inserts/updates). ? Timestamp drift issues if clocks are not synchronized.
2?? Incremental CDC Using Primary Key (High Watermark)
How it Works:
Example Query:
SELECT * FROM source_table
WHERE id > (SELECT MAX(id) FROM target_table);
? Pros: ? Simple and efficient for append-only tables. ? Less overhead than log-based CDC.
? Cons: ? Won’t detect updates or deletes, only new inserts. ? Needs a strict, always-increasing primary key.
3?? Log-Based CDC (Transactional CDC)
How it Works:
? Pros: ? Most accurate method (captures deletes & updates). ? Low impact on the source database (doesn’t require extra queries). ? Works even if timestamps aren’t reliable.
? Cons: ? Requires database-level access to transaction logs. ? Setup can be complex (e.g., Debezium, Kafka Connect). ? Some databases may not retain logs for long.
4?? Trigger-Based CDC
How it Works:
领英推荐
? Pros: ? Can capture inserts, updates, and deletes. ? Works on databases that don’t support log-based CDC.
? Cons: ? Performance overhead (triggers slow down inserts/updates). ? Can cause contention in high-throughput systems.
5?? Snapshot-Based CDC (Full Extract & Compare)
How it Works:
? Pros: ? Works universally (no need for timestamps, logs, or triggers). ? Simple to implement with tools like Delta Lake, Iceberg, or Hudi.
? Cons: ? Expensive & slow for large tables. ? Requires extra compute & storage.
Which CDC Method Should You Use?
Real-World Use Cases
? Data Warehouses (BigQuery, Snowflake, Redshift) → Use Timestamp-Based or Log-Based CDC
? Streaming ETL (Kafka, Spark, Flink) → Use Log-Based CDC
? Slow-changing data (SCD Type 2) → Use Snapshot-Based or Log-Based CDC
? Legacy Databases (No CDC support) → Use Trigger-Based CDC
Which One Do You Want to Implement? ??
#ChangeDataCapture #CDC #ETL #DataEngineering #DataIntegration
#BigData #DataPipeline #StreamingData #RealTimeData #DatabaseManagement
#DataTransformation #DataWarehousing #DataAnalytics #CloudData #DataStreaming
image credts : theseattledataguy.com
Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance
3 周Change Data Capture tracks data modifications in real time ?? Key types include Log-Based ??, which reads database logs for efficient tracking, Trigger-Based ??, using database triggers to capture changes, and Query-Based ??, periodically scanning tables for updates ?? Choosing the right CDC method ensures accurate and timely data synchronization ??