Testing Machine Learning Systems
Testing in software development is almost a science at this point. You’ve probably heard the quip, “Write tests, some unit, mostly integration”. The idea here is the most likely parts of your software to break are the interactions between systems. When another team has to call your API or use a schema from your messaging queue, these interactions traverse teams or departments, which means the full context of how the interaction should work is fragmented. Hence, an emphasis on integration tests which ensure that APIs, components, and separately contract testing, all work as intended.? With machine learning systems, not only do we need to test the interactions between different components, but also the properties of the non-deterministic model we’ve trained. I call these Macro and Micro tests; Micro verifies canonical examples while Macro provides aggregate performance snapshots of your model.
The goal of Micro tests is to place guardrails on your model. Imagine a product manager wants to understand the sentiment of product reviews. However, this person isn’t familiar with machine learning and leaves the rest of the problem definition to you. One way to help instill trust in the system is to ask the product manager for a list of positive or negative examples that the model will always predict correctly given a prediction threshold. In essence, what is the midpoint of the data distribution for sentiment. For example:
Negative sentiment micro tests:?
Positive sentiment micro tests:
On the contrary, examples like “The app tends to open when I click it” or “Most apps like this require an extensive amount of work” don’t reflect what we want to predict, sentiment, and wouldn’t be good tests.?
Once we have these, we can create test cases that each new and existing model is required to pass as part of a continuous integration test suite. Now, not only will you have guardrails for how your model performs, but the product manager can see examples and build a mental model for how your system will work in production. This, alongside a continuous stream of examples with predictions, is helpful in understanding ML systems.
We want examples that are in the middle of the data distribution we are predicting against
In contrast to the individual examples of Micro testing, Macro tests aim to capture aggregate performance metrics, e.g. precision and recall, similar to a train/test/validate split of your data. There are two goals of Macro tests:
领英推荐
The first is more straightforward. Split your data into 80/10/10 and use the last 10% to assess your macro/micro f1, precision, and recall scores. The actual performance requirements will depend on the business cost of an incorrect prediction as well as if the prediction is an opt-in interaction (e.g.: here is a suggestion that you can use) versus an opt-out interaction (e.g.: the system has taken a proactive action on your behalf).?
The spirit of the second is to test new models, or hyperparameters, on the same 10% holdout validation split. If you are comparing two models at the same point in time, this is easy to implement as shown below. Here, we are training on data from February 2019 - January 2020, testing on the month of data from January 2020 - February 2020 and validating on the month of data from February 2020 - March 2020. As shown below, there is a one month validation split for this data set.
Here we can see that both models are validated on the same validation dataset. Thus allowing a direct comparison in performance
However, as new data rolls in over time, the validate split should update to reflect this; the best way to test the generalizability of your systems is by validating on the most recent data. Following the example above, if we are training a new model architecture in April 2020, the validation split will shift to be from March 2020 to April 2020. Therefore, in order to ensure our Macro tests are valid across models, all existing, previously trained, and newly trained models need to use the same validation split of March-April 2020. It’s likely that a model trained on more recent data data performs better. However, this isn’t guaranteed. In my opinion, this is the best way to test generalizability of your architecture as new data comes in.?
Testing in machine learning is tricky due to the non-deterministic nature of the approaches used. Especially when building systems that involve people not as familiar with the details as yourself, to them a prediction with 63% confidence in may mean nothing. Therefore, this pattern is used to (1) Improve confidence that your model will do what you, and others, think it will do and (2) Create a process for updating models / parameters in an apples-to-apples fashion. I’ve found this framework useful for accomplishing both.
If you're interested, here are some additional resources on testing machine learning systems.
Thanks for reading!