MLOps | Versioning Datasets with Git & DVC

MLOps | Versioning Datasets with Git & DVC

GIT

GitHub uses an application known as Git to apply version control to your code. All the files for a project are stored in a central remote location known as a repository.

Its simplistic UI and ease of using commands make it the best fit for versioning the files.

But the data science projects also deal with data files along with code files, and certainly, it is not advisable to maintain let’s say a 50 GB data file and multiple versions on GitHub.

So, is there a way or a workaround to version our data files and keep track of it? Yes, and we achieve this with DVC(Data Version Control).


DVC

DVC enables Git to handle large files and directories with the same performance that you get with small code files.

commands like git clone can still be used and it pulls not only code files but also the associated data files in our workspace.


Data & Model Versioning:

DVC lets capture the versions of your data and models in Git commits, while storing them on-premises or in cloud storage. It also provides a mechanism to switch between these different data contents.

No alt text provided for this image
Is it possible to run a DVC pipeline with code from one branch?and dataset from?other?

Yes, certainly possible. As the model building is an iterative process, there may be scenarios where a given?branch has some new features in it and we want to check if those new features have any impact on my current model’s performance in?another branch. So instead of replicating the dataset and reproducing the new features, we could switch branches and leverage the existing data versions.

#DataVersionControl #datamanagement #dataops #versioncontrol #DataVersioning #CollaborativeData #GitForData #DataWorkflow #DataScienceTools #datasciencetools #datapipeline #datascience #dataanalytics #bigdataanalytics #datadrivendecisions #datainsights #dataanalysis #machinelearning #artificialintelligence #predictiveanalytics #datavisualization #statisticalanalysis #featureengineering #datamining #dataengineering #datascientists #businessintelligence


Vinoth Kumar

ML Engineer | Python | NLP | GenAI | Prompt Engineering | Ex-Aerospace Engineer

1 年

Hi Harshwardhan Jadhav, I have a question. to answer mu question, I would like to set the scenario as follows: 1. firstly I had added the dataset folder in dvc remote storage (dvc add input/data_converted) 2. secondly, just created data preparation stage (dvc stage add -n data_preparation -d data_preparation.py -d config.ini -d ./input/data_for_spacy_conversion/ -o ./input/data_converted/ -o input/data_for_training/ python data_preparation.py) I got an error as below: "ERROR: output 'input\data_converted' is already specified in stage: 'input\data_converted.dvc'. Use `dvc remove input\data_converted.dvc` to stop tracking the overlapping output." Can't we send dvc tracked folder as output parameter in dvc stage command????

要查看或添加评论,请登录

Harshwardhan Jadhav的更多文章

社区洞察

其他会员也浏览了