The State of MLOps?

The State of MLOps?

The State of DevOps report has now been published for more than ten years. Gene Kim "The DevOps Handbook" and Nicole Forsgren "Accelerate: Building and Scaling High Performing Technical Organizations" have taught thousands of professional, including me, the practices that make some organizations win in business through technical excellence.

If you have spent time in learning about these topics, you know they can be roughly described through this very simple list:

  • There is significant statistical evidence that companies that excel at technology are more likely to deliver outstanding business/organizational results
  • There are some KPI that can be used to measure being excellent at technology. These KPI measure software delivery performance and availability
  • There are some 24 common practices among elite performers
  • The way we understand, follow and implement these practices evolve. New practices emerge, previously separated practices become a single practice.

Time for "The State of MLOps?"

Organizations increasingly rely on Machine Learning and MLOps has emerged as a solution to less-than-satisfying return on investment on Machine Learning projects. While there is certainly increasing understanding that some practices improve the success of ML initiatives, I haven't fully understood what these practices are. I will try to provide my list here, creating a parallel with the technical practices from "Accelerate".

Practice 1: Version control everything

As in DevOps, version control should not be limited to code, but to every artifact used in the project. Version control should be applied to data, training code, performance evaluation code, trained models.

Practice 2. Invest in code quality

The Data Scientists community include people who have extraordinary diverse backgrounds and many haven't received sufficient programming training. As a result, code quality is often low, no code reviews are applied, no unit testing is required. This dramatically increases the risk of failing projects and the effort required to get something working. It also hampers collaboration across Data Scientist.

Practice 3. Automate training and model publishing

As automation is a critical part of DevOps since it reduces the risk of human failure, as well as it increase productivity, the same applies in machine learning projects. An automated machine learning pipeline is capable of:

  • Retrieving the data required to train the model with some configurable logic
  • Train the model, optionally provisioning the infrastructure for the training
  • Evaluate the model performance
  • Store the model in a repository

Practice 4. Maintain full traceability of models

In traditional software development, a version control system is used to store the code and artifacts are typically tagged with a unique identifier that can be traced back to a state in the version control system. Elite performers know how important is to version control the configuration of an application as well, and they often keep it in the same version control repository as the code.

For example, if a Git tag is created, the tag can be used to publish binaries with the same version as the tag. When adopting machine learning, this is not sufficient anymore, since the data used for training is a critical artifact too. Therefore, to fully identify a model, you need to know the exact version of the training code and the exact version of the training data

Practice 5. Manage binaries and separate deployment from releases

Software professionals manage the full lifecycle of their binaries: binaries are published in searchable repositories using immutable versions, so downstream consumers can consume them more easily. The same principle apply to trained models: storing them in a network folder is not enough.

In the latest years, especially thanks to containers and Kubernetes, elite performs have embraced blue/green, canary deployments and feature flags. The same technique could be used in applications that employ machine learning models to release new models to production quickly and mitigate the risks associated with the release

Practice 6. Proactive monitoring and observability

Proactive monitoring is a critical part of DevOps: things will inevitably go wrong. But if you have the right systems in place, you will be able to intervene and reduce the impact of the incident. On top of traditional performance monitoring such as response times and number of errors, monitor feature drift, label drift and prediction drift

Conclusion

This post was a (poor) attempt to create a parallelism between some of the DevOps practices and MLOps practices, as well as to propose an approach centered around the problem and the practices rather than the tools that helps in implementing those practices.

I am a big fan of using the right tools: we don't want the cost of adopting certain practices to exceed their benefit. However, looking at those practices first helps understand why those tools are so important. And also helps professionals prioritize practices that have an higher impact in their specific, unique business context.

Here a some tools that can help and their comparison to "non-ML" tools.

  • Model Registries, such as MLFlow Registry <=> JFrog Artifactory or Nexus Sonatype
  • Project templates, such as MLFlow pipelines <=> CookieCutter, Spring Boot, Yeoman
  • Feature registries <=> I would like to compare it again to a central repository of some fundamental ingredient of your project
  • Data Quality / Observability Tools such as Great Expectation, Monte Carlo, Soda <=> Prometheus/Grafana/AlertManager, DataDog/NewRelic

Olha Trushovskaya

Senior Financial Analyst

1 年

Edmondo, thanks for sharing!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了