登录查看更多内容

MLOps | Versioning Datasets with Git & DVC

Harshwardhan Jadhav

Data Scientist | NLP | Python | AI | ML

发布日期: 2023年8月19日

GIT

GitHub uses an application known as Git to apply version control to your code. All the files for a project are stored in a central remote location known as a repository.

Its simplistic UI and ease of using commands make it the best fit for versioning the files.

But the data science projects also deal with data files along with code files, and certainly, it is not advisable to maintain let’s say a 50 GB data file and multiple versions on GitHub.

So, is there a way or a workaround to version our data files and keep track of it? Yes, and we achieve this with DVC(Data Version Control).

DVC

DVC enables Git to handle large files and directories with the same performance that you get with small code files.

commands like git clone can still be used and it pulls not only code files but also the associated data files in our workspace.

领英推荐

GitHub Actions CI CD pipeline tutorial

Hari Mohan Prajapat 4 周前

Codefresh Insiders September 2024

Codefresh 6 个月前

Codefresh Insiders December 2024

Codefresh 3 个月前

Data & Model Versioning:

DVC lets capture the versions of your data and models in Git commits, while storing them on-premises or in cloud storage. It also provides a mechanism to switch between these different data contents.

Is it possible to run a DVC pipeline with code from one branch?and dataset from?other?

Yes, certainly possible. As the model building is an iterative process, there may be scenarios where a given?branch has some new features in it and we want to check if those new features have any impact on my current model’s performance in?another branch. So instead of replicating the dataset and reproducing the new features, we could switch branches and leverage the existing data versions.

#DataVersionControl #datamanagement #dataops #versioncontrol #DataVersioning #CollaborativeData #GitForData #DataWorkflow #DataScienceTools #datasciencetools #datapipeline #datascience #dataanalytics #bigdataanalytics #datadrivendecisions #datainsights #dataanalysis #machinelearning #artificialintelligence #predictiveanalytics #datavisualization #statisticalanalysis #featureengineering #datamining #dataengineering #datascientists #businessintelligence

AI | ML Spotlight

637 位关注者

Vinoth Kumar

1 年

Hi Harshwardhan Jadhav, I have a question. to answer mu question, I would like to set the scenario as follows: 1. firstly I had added the dataset folder in dvc remote storage (dvc add input/data_converted) 2. secondly, just created data preparation stage (dvc stage add -n data_preparation -d data_preparation.py -d config.ini -d ./input/data_for_spacy_conversion/ -o ./input/data_converted/ -o input/data_for_training/ python data_preparation.py) I got an error as below: "ERROR: output 'input\data_converted' is already specified in stage: 'input\data_converted.dvc'. Use `dvc remove input\data_converted.dvc` to stop tracking the overlapping output." Can't we send dvc tracked folder as output parameter in dvc stage command????

1 次回应

查看更多评论

要查看或添加评论，请登录

Harshwardhan Jadhav的更多文章

Difference Between map, filter, and reduce in Python

2025年2月11日

Difference Between map, filter, and reduce in Python

map(), filter(), and reduce() are functional programming tools in Python used for processing iterables like lists Map :…
Deploying Machine Learning Models with AWS Lambda and API Gateway

2024年6月8日

Deploying Machine Learning Models with AWS Lambda and API Gateway

Deploying machine learning models into production involves setting up an infrastructure that can handle user requests…
Streamlining Machine Learning Pipelines with Apache Airflow

2024年3月31日

Streamlining Machine Learning Pipelines with Apache Airflow

In the realm of Machine Learning (ML) and Data Science, creating robust and scalable pipelines is crucial for…

2 条评论
MLOps

2024年2月25日

MLOps

MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system…

11 条评论
Understanding Decision Trees in Machine Learning: A Clear Path to Effective Decision Making

2023年8月7日

Understanding Decision Trees in Machine Learning: A Clear Path to Effective Decision Making

Hello LinkedIn community! Today, we delve into one of the fundamental pillars of machine learning - Decision Trees…

1 条评论
Understanding the Power of Logistic Regression in Machine Learning

2023年8月7日

Understanding the Power of Logistic Regression in Machine Learning

Introduction: In the fast-paced world of machine learning, algorithms play a crucial role in extracting valuable…
Understanding Linear Regression in Machine Learning: A Fundamental Tool for Predictive Modeling

2023年8月7日

Understanding Linear Regression in Machine Learning: A Fundamental Tool for Predictive Modeling

Introduction: In the dynamic world of machine learning, where complex algorithms and cutting-edge techniques dominate…
EDA Cheat Sheet

2023年8月1日

EDA Cheat Sheet

EDA Cheat Sheet(s) Why we EDA Sometimes the consumer of your analysis won't understand why you need the time for EDA…
What is overfitting?

2023年7月27日

What is overfitting?

What is overfitting? A model is overfit if it works better on the training data than it does on other data. The name…
Normalization and standardization

2023年7月27日

Normalization and standardization

Normalization versus standardization Normalization means to scale values so that they all fit within a certain range…

See all articles

MLOps | Versioning Datasets with Git & DVC

Harshwardhan Jadhav

Data Scientist | NLP | Python | AI | ML

领英推荐

AI | ML Spotlight

637 位关注者

Harshwardhan Jadhav的更多文章

社区洞察

其他会员也浏览了

Git and Github for Data Scientists - Getting Started

GitOps Using ArgoCD

Docker Beginner to Expert Tutorial Series #1

Top Open Source GitOps Tools in 2024

OpenShift 4.X Foundations - Getting Started with GitOps

Day 27: Jenkins Declarative Pipeline with Docker

GitOps with cdk8s, Argo CD, and GitHub Actions

Does GitHub Co-Pilot Herald A New Dawn In Generative AI Code Generation?

Okay so what the hell is CI/CD anyways ??

CI/CD: Streamlining Deployments with GitHub Actions

领英推荐

AI | ML Spotlight

637 位关注者

Harshwardhan Jadhav的更多文章

Difference Between map, filter, and reduce in Python

Deploying Machine Learning Models with AWS Lambda and API Gateway

Streamlining Machine Learning Pipelines with Apache Airflow

MLOps

Understanding Decision Trees in Machine Learning: A Clear Path to Effective Decision Making

Understanding the Power of Logistic Regression in Machine Learning

Understanding Linear Regression in Machine Learning: A Fundamental Tool for Predictive Modeling

EDA Cheat Sheet

What is overfitting?

Normalization and standardization

社区洞察

其他会员也浏览了

Git and Github for Data Scientists - Getting Started

GitOps Using ArgoCD

Docker Beginner to Expert Tutorial Series #1

Top Open Source GitOps Tools in 2024

OpenShift 4.X Foundations - Getting Started with GitOps

Day 27: Jenkins Declarative Pipeline with Docker

GitOps with cdk8s, Argo CD, and GitHub Actions

Does GitHub Co-Pilot Herald A New Dawn In Generative AI Code Generation?

Okay so what the hell is CI/CD anyways ??

CI/CD: Streamlining Deployments with GitHub Actions