Data Version Control: Elevate Your Data Science Workflow with DVC
Understanding Data Version Control (DVC): Managing Data Like Code
Just as version control systems like Git revolutionized software development, Data Version Control (DVC) is emerging as a crucial tool for managing and versioning data effectively. Data Version Control (DVC) is an open-source version control system that is designed specifically for managing machine learning models and large datasets. It seamlessly integrates with Git, providing a comprehensive solution for versioning both code and data in machine learning projects.?
DVC enables data scientists and machine learning engineers to track changes to datasets, share and collaborate on data-centric projects, and reproduce experiments reliably.
Today, we will explore the Data Versioning and Data Pipeline feature of DVC.
Why do we need DVC when we already have Git?
2 important points (among others):
How does DVC maintain a pointer to the data?
When you add a data file to DVC using the DVC add command, DVC calculates a checksum (MD5 hash) for the file's contents. This checksum serves as a unique identifier for the file and ensures its integrity throughout its lifecycle.
Code Tutorial
pip install dvc
Refer to this link for a more detailed installation guide.
git init
dvc init
This command creates a new DVC project with a .dvc directory containing the necessary configuration files.
git status
You’ll see this result:
new file: ? .dvc/.gitignore
new file: ? .dvc/config
new file: ? .dvcignore
git add .
git commit -m "adding dvc config files"
Either pick any of your datasets in the same git project or download one of the sample datasets from DVC:
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
This command will download a data.xml file inside ./data/ directory inside the project.
dvc add data/data.xml
领英推荐
It’s contents will look something like this:
outs:
- md5: 22a1a2931c8370d3aeedd7183606fd7f ?
size: 14445097 ?
hash: md5 ?
path: data.xml
So basically, DVC tracks changes in dataset, and Git tracks the .dvc file.
git add data/data.xml.dvc data/.gitignore
git commit -m "Add raw data"
pip install dvc-s3
dvc remote add -d myremote s3://<bucket>/<key>
outs:
- md5: 9f53dad1261633b7a51023bc6533ff62
size: 14445096
hash: md5
path: data.xml
cloud:
myremote:
etag: 9f53dad1261633b7a51023bc6533ff62
version_id: 3jdwphhgB_WJfR7M.jR107IzkQ7iWILm
dvc remote add -d -f myremote s3://dvc-test-aishit/data.xml
dvc remote modify myremote version_aware true
dvc push
dvc commit data/data.xml
dvc push
Now, if we want to revert our old dataset, we just have to let go of the recent changes and pull the corresponding dataset from DVC.
git stash
dvc pull
Note that everytime you change your dataset and commit on DVC, the .dvc file changes the md5 hash.
Find my repo here: https://github.com/aishitdharwal/dvc-learning
Senior Product Manager @ IDFC FIRST Bank
7 个月Taniya Sharma
Community Manager, Master plate spinner, Connector of people and ideas
7 个月Thanks for the shout out Aishit!
Great insights on Data Version Control using AWS S3! This article truly simplifies the process for both seasoned professionals and newcomers. Thanks for sharing.
Startups Need Rapid Growth, Not Just Digital Impressions. We Help Create Omni-Channel Digital Strategies for Real Business Growth.
7 个月Wow, this is a real eye opener! I've been struggling with data version control for a while now, and the chaos is real. But hearing about DVC with S3 sounds like a breath of fresh air. Your article seems like just what I need to get a grip on this whole situation. Can't wait to dive in and learn more about how AWS S3 can revolutionize data version control. Thanks for sharing your expertise, Jenifer!
Senior Product Manager @ IDFC FIRST Bank
7 个月Mohamed Suhail Irfan Khazi check it out