登录查看更多内容

How to use DagsHub for Data?Science

Data Professor

Data Professor is a YouTube channel providing educational videos on Data Science concepts and practical tutorials.

发布日期: 2022年7月28日

The data science lifecycle encompasses the process from data collection, analysis, deployment and monitoring. But what is often overlooked is the underlying infrastructure that makes the entire lifecycle run smoothly and seamlessly.

This is especially true as data projects evolve over time as more and more data is collected, annotated and modified while models are built, optimized and re-built (as models drift and as models are being trained on new data). The challenge in machine learning reproducibility is in keeping track of the various loose ends of a data project, across different parallel versions of the project and with various members of a data team.

In this article, we’re going to take a look at how we can use the DagsHub platform for managing our data science projects.

What is?MLOps?

As the process of building machine learning models in a typical project is rarely a one-time endeavor owing to the fact that models are incrementally evolving over time as more data are collected and annotated, models are built and re-built, etc. It is therefore not feasible to manually build models.

Machine Learning Operations (MLOps) is a term used to describe the set of processes and tools that help manage the end-to-end lifecycle of machine learning models. This includes everything from data preparation, feature engineering, model training, model deployment as well as model monitoring.?

MLOps helps to streamline the machine learning lifecycle by automating many of the tasks that are traditionally done manually. By automating these tasks, MLOps can help reduce the time and effort required to manage machine learning models. In addition, MLOps can help to improve the quality of machine learning models by providing a more consistent and repeatable process.

What is?DagsHub?

Enter DagsHubs .

DagsHub is a platform that is built for facilitating data science collaboration by allowing teams to share, review and reuse work. In a nutshell, DagsHub can be thought of as a GitHub-like platform that is made specifically for data science and machine learning. Particularly, the platform facilitates the versioning of data, models, experiments and code so that one can revert to a prior version if needed.

The platform is built using popular open-source tools and formats that data scientists already use and as such helps to lower the barrier to entry for its adoption in managing data science projects.?

Let’s now take a look at some of the features on how DagsHub can be used for managing data science projects:

Git servers such as GitHub , Gitlab and Bitbucket
Google Colab — allows writing and executing of Python code in the browser
DVC — tracks ML models, large data and pipeline
mlflow — supports the tracking of machine learning experiments
Jenkins — supports ML automations
Label Studio — annotate various data types (images, audio, text, etc.)
External storage such as Amazon Web Services (AWS) and Google Cloud Platform (GCP)
Webhooks — Supports the update of repository events via web hooks to Slack and Discord.

MLOps Workflow with?DagsHub

Let’s now take a look at the underlying framework of a typical MLOps workflow for managing machine learning models and data using the DagsHub platform.

Here, we will summarize the general MLOps workflow into 3 chunks consisting of:

Data management
Model development
Experiment tracking

Illustration of the MLOps workflow using DagsHub. Workflow drawn by the Author (Chanin Nantasenamat) and workflow designed by Nir Barazida.

Data management

Data lies at the core of the data life cycle and in order to ensure robust predictions and insights that are drawn from the model, it is essential that great care and attention is directed towards data management. The process of data management includes data collection, data preparation, data annotation and their peer review.?

Doug Rose 1 个月前

The CopilotKit Project, Data Engineering, MLOps, and…

Rami Krispin 3 周前

Thinking about making the shift to data science?

Maven Analytics 6 个月前

Data collection is the process of gathering data from various sources. As data become larger, it is no longer possible to host them on GitHub and as such can be stored on the cloud (e.g. S3 bucket from AWS, GCS from Google or DagsHub Storage). DagsHub’s very own Storage supports up to 10 GB of free space for its community tier. It is also worthy to note that DagsHub allow users to jump in time between versions with a click of a button. Particularly, they can view and diff the changes directly on the DagsHub platform for images, audio, CSV, etc.

Data annotation is the process of adding annotations or labels to data that are subsequently used in ML model building. DagsHub supports the use of Label Studio directly on its platform with its DagsHub Annotations. This means that labelings can be done directly on the platform and after doing so they can be versioned via Git. Further details is covered in a dedicated blog post .

Finally, data peer review is the process of allowing experts to review data so as to ensure the accuracy and completeness of the data.

Model development

Now that we gathered our raw data and labeled it, we’re now ready to process it and develop our model.

Machine learning model development (or simply model development) literally entails the development of machine learning models by applying learning algorithms to learn from past data. To kick things off, data is preprocessed prior to model development. If data is large, a subset of the data may be used to perform small scale training.?

It is worthy to note that the training of models is an iterative process that may entail several rounds of trial-and-error experiments (i.e. modify data processing, modify model architecture, etc.), which will be covered in the next section.

As shown in the illustration below we can perform peer review on constructed models in efforts to develop robust models. Afterwards, code are versioned via Git while data and model are versioned via DVC so as to facilitate project reproducibility.

Experiment tracking

Keeping track of all components (e.g. data, code and model) of a data project makes it easy to reproduce previous efforts. This is especially helpful when working in a team as it allows us to manage the project more efficiently as well as allowing members to work on different components concurrently. DVC and MLflow are among the MLOps technologies that offer data versioning.

The experiment tracking workflow is summarized in the illustration shown below. Briefly, the construction of a single machine learning model constitute a single experiment whereas alterations to the model (i.e. modify/annotate data, hyperparameter tuning, modify model architecture) yields additional experiments that may be tracked via tools such as MLflow and automate via Jenkins. Version tracking of models and pipeline are carried out via DVC and Git.

Finally, peer review of the experiments (e.g. debug data drift, debug model performance, correct wrongly annotated data, etc.) would help to ensure its readiness in deploying to production.

Piecing everything together

In a nutshell, the MLOps workflow essentially entails the orchestration of 3 intertwining components: (1) data management, (2) model development and (3) experiment tracking.

Briefly, as data is collected and prepared it is then relayed to the model development process; after which experiment components consisting of data, code and model are tracked and monitored. A great advantage stemming from the versioning and tracking of code, data and model is the inherent ability to seamlessly move between steps, work on all parts simultaneously, and have all project components fully reproducible.

Complementary tutorial?video

I’ve created a 28 minute video showing how to use DagsHub for machine learning projects in this step-by-step practical tutorial.

Conclusion

MLOps is a relatively new field that is growing in popularity as more companies adopt machine learning models into their business processes. In this article, we have successfully explored the use of DagsHub for setting up an MLOps pipeline for our data project. Some of the immediate advantages of implementing such a system for data projects is that it facilitates reproducibility, allow several team members to work on simultaneously on the project and the archival of different versions of the data, model and code.

Note: This article was originally posted on Medium in the Data Professor publication: Click here to view it.