Role of DevOps in MLOps
MLOps encourages to not to think of machine learning as a scientific experiment but as a continuous process to develop , create and maintain a machine learning capability that real world can use. It needs to be collaborative , reproducible , continuous and tested solution which brings me to a practical implementation using DevOps practices only.
Lets look at some of the ways that DevOps can help in MLOps in terms of achieving the above objectives.
Version Control is not just for code
This is a very common notion that version control come handy in case of code management but that is not true. Any entity that is subjected to change can be version controlled.
In world of machine learning this becomes more evident to not only version control within the code but also the data and metadata used for training.
The underlying data and metadata keeps evolving continuously in ML and it is very much necessary to have a benchmark for subsequent times you train models.
Lastly, there must be proper version control for the model itself. As a data scientist, your goal is to continuously improve machine learning models’ accuracy and reliability, so the evolving algorithm needs to have its own versioning. This is often referred to as a model registry.
With proper version control, you ensure reproducibility at all times, which is crucial for governance and knowledge sharing.
Treat Pipeline as the product and not the model
Totally, CI/CD brings together data, model, and code components to release and refresh a predictive service.
ML completely focuses on a repeatable process for an experiment which can then be matured on basis of different datasets and models to look for best combination.
Pipeline is the best product suited for such scenarios and is the only way to support production ML in long term.
The underlying data that supports the model will change rapidly, and the model will drift. This means that eventually, the model will have to be retrained and adjusted to provide accurate outcomes in a new environment.
As a result, the pipeline that produces accurate and effective machine learning models should be the product that data scientists focus on creating. So, what exactly is a machine learning pipeline?
Each time the pipeline performs the following sequence of steps:
Data ingestion. Any ML pipeline starts with data ingestion — in other words, acquiring new data from external repositories or feature stores, where data is saved as reusable “features”, designed for specific business cases. This step splits data into separate training and validation sets or combines different data streams into one, all-inclusive dataset.
Data validation. The goal of this step is to make sure that the ingested data meets all requirements. If anomalies are spotted, the pipeline can be automatically stopped until data engineers fix the problem. It also informs if your data changes over time, highlighting differences between training sets and live data your model uses in production.
Data preparation. Here, raw data is cleansed and gets the right quality and format so your model can consume it. At this step data scientists may intervene to combine raw data with domain knowledge and build new features, using capabilities of DatRobot, Featuretools or other solutions for feature engineering.
Model training. At last, we come to the core of the entire pipeline. In the simplest scenario, the model is trained against freshly ingested and processed data or features. But you can launch several training runs in parallel or in sequence to identify the best parameters for a production model as well.
Model validation. This when we test the final model performance across the dataset it has never seen before to confirm its readiness for deployment.
Data versioning. Data versioning is the practice of saving data artefacts similar to code versions in software development. The popular way to do it is to use DVC a lightweight CLI tool on top of GIT, though you can find similar functionality in more complex solutions like MLflow or Pachyderm.
A developed machine learning pipeline also allows you to exercise control over how models are implemented and used within the business. It also improves communication across departments and will enable others to review the pipeline – rather than manual workflows – to determine if changes need to be made. Similarly, it reduces production bottlenecks and allows you to make the most out of your data science capabilities.
I will try to deep dive more in DevOps in ML in my next post. Stay Tuned - some exciting stuff coming up soon!!
Note: This content is individual opinion and not tied to any enterprise or corporate work.
Great article!