MLOps | Versioning Datasets with Git & DVC
GIT
GitHub uses an application known as Git to apply version control to your code. All the files for a project are stored in a central remote location known as a repository.
Its simplistic UI and ease of using commands make it the best fit for versioning the files.
But the data science projects also deal with data files along with code files, and certainly, it is not advisable to maintain let’s say a 50 GB data file and multiple versions on GitHub.
So, is there a way or a workaround to version our data files and keep track of it? Yes, and we achieve this with DVC(Data Version Control).
DVC
DVC enables Git to handle large files and directories with the same performance that you get with small code files.
commands like git clone can still be used and it pulls not only code files but also the associated data files in our workspace.
领英推荐
Data & Model Versioning:
DVC lets capture the versions of your data and models in Git commits, while storing them on-premises or in cloud storage. It also provides a mechanism to switch between these different data contents.
Is it possible to run a DVC pipeline with code from one branch?and dataset from?other?
Yes, certainly possible. As the model building is an iterative process, there may be scenarios where a given?branch has some new features in it and we want to check if those new features have any impact on my current model’s performance in?another branch. So instead of replicating the dataset and reproducing the new features, we could switch branches and leverage the existing data versions.
#DataVersionControl #datamanagement #dataops #versioncontrol #DataVersioning #CollaborativeData #GitForData #DataWorkflow #DataScienceTools #datasciencetools #datapipeline #datascience #dataanalytics #bigdataanalytics #datadrivendecisions #datainsights #dataanalysis #machinelearning #artificialintelligence #predictiveanalytics #datavisualization #statisticalanalysis #featureengineering #datamining #dataengineering #datascientists #businessintelligence
ML Engineer | Python | NLP | GenAI | Prompt Engineering | Ex-Aerospace Engineer
1 年Hi Harshwardhan Jadhav, I have a question. to answer mu question, I would like to set the scenario as follows: 1. firstly I had added the dataset folder in dvc remote storage (dvc add input/data_converted) 2. secondly, just created data preparation stage (dvc stage add -n data_preparation -d data_preparation.py -d config.ini -d ./input/data_for_spacy_conversion/ -o ./input/data_converted/ -o input/data_for_training/ python data_preparation.py) I got an error as below: "ERROR: output 'input\data_converted' is already specified in stage: 'input\data_converted.dvc'. Use `dvc remove input\data_converted.dvc` to stop tracking the overlapping output." Can't we send dvc tracked folder as output parameter in dvc stage command????