"Fail fast" vs Machine learning.
Chris Pedder
Chief Data Officer @ OBRIZUM | Board advisor | Data transformation leader | Posting in a personal capacity.
Yep, you read that right. There can be only one...
Of course, not really. But there are issues. To get to that point, first we have to duck back into a time when Highlander was still a thing. Well, almost. The first big tech boom of the internet bubble created the conditions in which many of the modern giants were forged, a time of easy money and a fail fast culture. Software development, which had been a careful, herbivorous process before became stripped down and lean, fast and effective, and this new approach saw the launch of some of the biggest companies out there today. Three in particular succeeded off the back of apparently very different approaches, Amazon (fastest) got there first and strove to maintain their lead, Google (smartest) leveraged their research capabilities and Apple (cutest) out-marketed everyone else. People saw the new era, and they wanted to embrace a new approach to software that seemed to make these fast-moving companies succeed.
Agile at its core is a very sensible approach to software development. It's basically local optimization under constraints. For those of you familiar with machine learning it's gradient descent. Take a look at where you are, take some time (usually a two-week sprint) to look around, follow the path that leads to the next local optimum you can see, rinse and repeat. Except there's a third rule in there that often gets forgotten, but in many ways is the most important of them all - if your path downhill bifurcates, take the path that looks like it has more options in the future, rather than hemming yourself in. Just as good software engineers know this, so do good chess players - choose mobility over materiality, or end up in the dreaded Zugswang position (https://en.wikipedia.org/wiki/Zugzwang).
So why do I think this is problematic? Well, in principle I don't, the way in which the problem emerges isn't from the implementation, it's from tradition. Machine learning systems are fundamentally different from non-ML software development - for one thing, they are dynamic, almost living things, and there is no real concept of doing things so they stay done. As new data comes along, you need to retrain and evolve your model to do the same thing. They also have many moving parts, which cannot be individually optimized. In normal code development, you can write an object or a routine, optimize it and then incorporate it into a larger ecosystem. Machine learning systems aren't amenable to this modular approach, in fact, that is kind of the point - they are, after all, *non-linear* function approximators. So it's probably not surprising that you can't make them out of linearly connected modules and have them work optimally. Instead, they require a global search, and this is where a system like agile designed for local exploration can lead you into problems - like a lot of research problems, you can spend your whole two-week sprint exploring dead space, and producing no progress. Adherents of fail-fast would say "okay, you're in a dead end, stop", but the reality is that you are rarely wasting time, you're just doing your training to make progress. Like a reinforcement-learning robot, you have to fall over more than a few times before you get to the finish line.
So how do we do it better? Well, coming from a physics research background, ML projects are much more like research problems than they are conventional software development. Whilst there are many problems with the practical details of how physics research is funded, we can take away some salient lessons on what works well. So here are my suggestions.
- Give ML staff the freedom to look for their own applications within the business, rather than giving them problems to solve. They will naturally seek out problems that are most amenable to what they can do.
- Allow them to "apply" for time to work on projects. The initial block of time should be easy to get, and allow for them to make mistakes without those mistake being detrimental to the project's success. The average time to build an ML model in industry is 59 working days, with even longer needed to productionize.
- Allow extensions to the initial period contingent on there being measurable results, and a high probability of success during the extension period.
- Involve non-ML specific team members in ML projects, and make your ML staff present and explain their work to the rest of their colleagues. Under no circumstances allow ivory towers to be built!
- Work hard to deal with expectation management - it's a hot field, it's very hyped, and there are more than a few people out there who believe in magic. Magic is for kids, a lot of sweat and a lot of wrong turns go into making a working ML model.
Based on my experiences of working in deep learning, I suggest an update to the old adage for the machine learning age. It shouldn't be "fail fast". It should be "work fast, fail often, succeed when it matters most".