Data Versioning: Different Approaches for Different Needs
Armando Rodrigues
Data Engineer | Analytics Engineer | AWS | DBT | Python | SQL | Analytics | Airflow | Redshift Analytics Engineer | AI & Automation Expert |
In data engineering, keeping track of historical data is crucial. Whether you need to back up critical tables, track changes, or ensure reproducibility, versioning strategies matter. Two common approaches are:
1?? Pre-Hook Backup in dbt
With dbt pre-hooks, we can create a backup before overwriting a table:
{{ config( materialized='table', pre_hook=" CREATE TABLE IF NOT EXISTS backups.my_table_backup_{{ run_started_at.strftime('%Y%m%d') }} AS SELECT * FROM analytics.my_table" ) }}
? Advantages: ? Easy to implement with dbt. ? Keeps historical snapshots in the same database. ? Enables quick restoration.
? Disadvantages: ? Increases database storage usage. ? Might not scale well if storing daily snapshots.
2?? Storing Versions in a Data Lake (S3, Delta Lake, Iceberg, etc.)
Instead of keeping backups in the database, we can store snapshots in a data lake:
s3://my-bucket/backups/my_table/2025-01-30/ s3://my-bucket/backups/my_table/2025-01-31/
? Advantages: ? Cost-effective and scalable storage. ? Supports open formats (Parquet, Delta, Iceberg). ? Works well with distributed compute engines like Athena, Redshift Spectrum, Databricks.
? Disadvantages: ? Requires additional orchestration for retrieval. ? Might introduce complexity in accessing old versions.
Choosing the Right Approach
No single method is perfect! Some teams keep short-term versions in the database for fast access and long-term archives in the data lake for cost efficiency. The key takeaway? Modern data tools allow multiple ways to store history, and choosing the right one depends on your architecture and needs. ??
What’s your preferred versioning strategy? Let’s discuss! ??
#DataEngineering #dbt #DataVersioning #DataLakes #ModernDataStack
Very helpful
Fullstack Engineer | Java | Spring Boot | Software Developer | React | Angular | Docker | PostgreSQL | MySQL | Linux | Google Cloud | AWS
1 个月Thanks for sharing
Senior Software Engineer | Frontend focused Developer | React | Next | Node | Java | AWS | JavaScript | TypeScript | SQL
1 个月Very informative. Thanks for sharing.
Senior Software Engineer | Backend-Focused Fullstack Developer | .NET | C# | Angular | React.js | TypeScript | JavaScript | Azure | SQL Server
1 个月Very helpful, thanks for sharing Armando Rodrigues
Senior Software Engineer | Backend Developer | Nodejs | Nestjs | Typescript | AWS | CI/CD | Kubernetes
1 个月nice content