MLOps: Don't do it.
How the Shoe Cobbler enabled MLOps
Imagine life before electricity. Everything was manual. Creating products from raw material was not a trivial process. For example, a shoe manufacturer needs to source and cut the leather and rubber for the soles. Somehow they needed to be stitched together. To keep things simple, prior to 1830 there was no 'left' or 'right' shoe, instead both shoes were straight.
A shoe factory had to automate as much as they could, given the lack of resources they had. The industrial age introduced machines that could handle the work faster and more efficient than a worker could, but at the time you couldn't simply plug the machine in. There was no electricity. Instead, factory owners had to create their own power source, usually either a steam engine or a water wheel. From there huge rotating shafts (called Line Shafts) ran along the ceiling, which were then attached to cotton belts that rotated on a pulley down to the machine itself. This worked, but was still inefficient: machines could not be moved to better the process, rather they were fixed according to the line shaft location. Oil and dirt were constantly introduced into the facility simply because the system needed the oil to run smoothly.
What was the shoe factory really making? What was their core competency?
It should have been making shoes. Instead, it was making power. It was buying, installing, and maintaining a system that they had no expertise in and was only doing so out of necessity. Once external electricity was introduced, suddenly the factory owners could offload the responsibility for this overhead, and outsource the power manufacturing to someone who was better at it, and could do it more efficiently and less costly. They outsourced their power generation to a public utility.
This is exactly what the cloud does for customers. It gets companies out of the infrastructure business and all the overhead that goes with buying, running, and maintaining a data center. In fact, it makes it much more efficient and offers significant cost benefits in the form of elasticity and moving capital costs (which could be considered 'sunk') to an operational model. The same logic applies to Machine Learning Operations.
Get rid of your MLOps
Machine learning is changing the world. Computers can now automate routine tasks such as identifying objects in images. Converting speech to text. Pipelining models together allows organizations to create very complex models that do things like having cars seeing and recognizing traffic signs, crosswalks, and people. The use cases being handled by machine learning is mind boggling, and almost limitless. In the context of a commercial organization, the goal of machine learning should be to create business value by:
- Reducing costs by automating routine work
- Improving customer experience and therefore retention
- Creating new offerings of value that customers will pay for
These should be the focus of the advanced analytics strategy for every company. Using machine learning to translate data into actionable items that improve the business should be their core competency.
Instead, however, companies invest significant amounts of money and time into data scientists, while also burdening them with machine learning deployment, operations, and infrastructure, which is outside their skillset that detracts from this core mission. MLOps is the biggest distraction here. Getting models out of development and training and into production by deploying, operating, and managing them is a very large and very difficult undertaking. It's like having a shoe factory spend time and money managing a water wheel and Line Shaft. Unless there is no other option, It's a fool's errand.
Bus Factor vs Buy
We talk to companies every day in every industry who are tackling the MLOps problem in a variety of ways. The truth is, engineers like to build things...and there's no shortage of open source tools from which they can pick and choose to solve a particular problem. We see companies using tools such as Sagemaker, Apache Airflow, Python, MLflow, Kubeflow, and others that cobble together an orchestration solution that somehow dumps the results onto a Kubernetes cluster that ultimately deploys the model.
Similar to the shoe factory, these companies are just trying to solve a problem the best way they can. But again, this is not their core competency and is distracting from the missions outlined in the section above. In fact, what they're really doing is adding incredible risk to the organization by creating a system that is fragile, incomplete, undocumented, and requires very specialized skills to maintain. In business school, we often talk about creating processes that are standardized and repeatable, and we test that by asking the question 'if I were hit by a bus, would this process still run?'. Morbid, maybe, but the reality is if someone leaves the organization for whatever reason, what happens to that system they built? Who can step in and seamlessly operate and maintain it? What if the models are customer facing and the system hiccups...even for an hour? What is the impact to the business? In some cases it can be very, very expensive to the top line. Turnover is one of the biggest pain points we hear about in maintaining an internal solution.
What do I do?
If your company is already using cloud computing, then you already know the answer. You need to think of MLOps the same way you think about infrastructure. If you can offload it to a commercial platform that is better at it, is more cost effective, more stable, and futureproof then you need to take that option. Even if you're still on-prem due to regulations, security requirements, or simply management belief that owning is better than renting, you need to assess the risk of creating your own MLOps practice. The complexity of infrastructure options only adds to the risk. Multi cloud and hybrid is normal now, and brings its own complications. Again, let someone else who is better equipped take on that responsibility.