Data Version Control: Elevate Your Data Science Workflow with DVC
Elevate Your Data Science Workflow with DVC

Data Version Control: Elevate Your Data Science Workflow with DVC

Understanding Data Version Control (DVC): Managing Data Like Code

Just as version control systems like Git revolutionized software development, Data Version Control (DVC) is emerging as a crucial tool for managing and versioning data effectively. Data Version Control (DVC) is an open-source version control system that is designed specifically for managing machine learning models and large datasets. It seamlessly integrates with Git, providing a comprehensive solution for versioning both code and data in machine learning projects.?

DVC enables data scientists and machine learning engineers to track changes to datasets, share and collaborate on data-centric projects, and reproduce experiments reliably.

Today, we will explore the Data Versioning and Data Pipeline feature of DVC.

Why do we need DVC when we already have Git?

2 important points (among others):

  1. Efficient Handling of Large Files: Git stores the entire history of files within the repository, which can lead to bloating, especially with large files. DVC, on the other hand, stores only lightweight metadata in Git while managing the actual data files externally.
  2. Versioning Data and Models: DVC enables versioning of these large files outside of Git, ensuring smooth collaboration and reproducibility without compromising performance.

How does DVC maintain a pointer to the data?

When you add a data file to DVC using the DVC add command, DVC calculates a checksum (MD5 hash) for the file's contents. This checksum serves as a unique identifier for the file and ensures its integrity throughout its lifecycle.

Code Tutorial

  • The easiest way to install DVC is using pip

pip install dvc        

Refer to this link for a more detailed installation guide.

  • Since DVC is a tool to enhance the capabilities of Git, we have to start inside a git repository - so either pick an already existing one, or create a new one using:

git init        

  • Now it’s the time to start a new DVC project within our git repository

dvc init        

This command creates a new DVC project with a .dvc directory containing the necessary configuration files.

  • The configuration files need to be tracked by git. You can see these files using:

git status        

You’ll see this result:

new file: ? .dvc/.gitignore 
new file: ? .dvc/config 
new file: ? .dvcignore        

  • We need to stage and commit these files so that Git can track these

git add . 
git commit -m "adding dvc config files"        

  • Now it’s time to add some of our datasets for DVC to track!

Either pick any of your datasets in the same git project or download one of the sample datasets from DVC:

dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml        

This command will download a data.xml file inside ./data/ directory inside the project.

  • Add the data file to start tracking the dataset file:

dvc add data/data.xml        

  • As soon as you add your dataset, it creates a .dvc file which tracks changes in the datasets

It’s contents will look something like this:

outs: 
- md5: 22a1a2931c8370d3aeedd7183606fd7f ? 
  size: 14445097 ? 
  hash: md5 ? 
  path: data.xml        

  • Now since we don't want Git to track the actual dataset, due to size limitations, we can just track the .dvc files via Git.

So basically, DVC tracks changes in dataset, and Git tracks the .dvc file.

  • So, now add the DVC files in Git

git add data/data.xml.dvc data/.gitignore 
git commit -m "Add raw data"        

  • Now, say you want to keep your datasets in an S3 bucket. First, you’ll have to install the dvc-s3 library which supports S3.

pip install dvc-s3        

  • Create a new S3 bucket or use an existing one. Make sure versioning is enabled in the bucket (while you created it).

  • Now, let’s add the bucket in our config file. <key> is the directory address within your bucket, where you want to keep your datasets.

dvc remote add -d myremote s3://<bucket>/<key>        

  • You would notice that your .dvc file now has some fields which mention your remote.

outs:
- md5: 9f53dad1261633b7a51023bc6533ff62
  size: 14445096
  hash: md5
  path: data.xml
  cloud:
    myremote:
      etag: 9f53dad1261633b7a51023bc6533ff62
      version_id: 3jdwphhgB_WJfR7M.jR107IzkQ7iWILm        

  • Next step is to add your data files. Here, dvc-test-aishit is my S3 bucket.

dvc remote add -d -f myremote s3://dvc-test-aishit/data.xml        

  • Now, enable cloud versioning features for this remote.

dvc remote modify myremote version_aware true        

  • Now, just push the files to your remote.

dvc push        

  • You should now be able to see your dataset in your S3 bucket.
  • Now, try to make some changes in your dataset. Then, commit and push on DVC.

dvc commit data/data.xml 
dvc push        

  • You would now be able to see a new version of your dataset on the S3 bucket.

  • Note that, we didn’t commit our changes on git yet.

Now, if we want to revert our old dataset, we just have to let go of the recent changes and pull the corresponding dataset from DVC.

git stash
dvc pull        

  • You should now have the old dataset back in your local machine.

Note that everytime you change your dataset and commit on DVC, the .dvc file changes the md5 hash.

Find my repo here: https://github.com/aishitdharwal/dvc-learning

Shreya Khot

Senior Product Manager @ IDFC FIRST Bank

7 个月
Jenifer De Figueiredo

Community Manager, Master plate spinner, Connector of people and ideas

7 个月

Thanks for the shout out Aishit!

Great insights on Data Version Control using AWS S3! This article truly simplifies the process for both seasoned professionals and newcomers. Thanks for sharing.

Adhip Ray

Startups Need Rapid Growth, Not Just Digital Impressions. We Help Create Omni-Channel Digital Strategies for Real Business Growth.

7 个月

Wow, this is a real eye opener! I've been struggling with data version control for a while now, and the chaos is real. But hearing about DVC with S3 sounds like a breath of fresh air. Your article seems like just what I need to get a grip on this whole situation. Can't wait to dive in and learn more about how AWS S3 can revolutionize data version control. Thanks for sharing your expertise, Jenifer!

Shreya Khot

Senior Product Manager @ IDFC FIRST Bank

7 个月

要查看或添加评论,请登录

社区洞察

其他会员也浏览了