Testing Machine Learning Systems
https://www.jeremyjordan.me/testing-ml/

Testing Machine Learning Systems

Testing in software development is almost a science at this point. You’ve probably heard the quip, “Write tests, some unit, mostly integration”. The idea here is the most likely parts of your software to break are the interactions between systems. When another team has to call your API or use a schema from your messaging queue, these interactions traverse teams or departments, which means the full context of how the interaction should work is fragmented. Hence, an emphasis on integration tests which ensure that APIs, components, and separately contract testing, all work as intended.? With machine learning systems, not only do we need to test the interactions between different components, but also the properties of the non-deterministic model we’ve trained. I call these Macro and Micro tests; Micro verifies canonical examples while Macro provides aggregate performance snapshots of your model.

The goal of Micro tests is to place guardrails on your model. Imagine a product manager wants to understand the sentiment of product reviews. However, this person isn’t familiar with machine learning and leaves the rest of the problem definition to you. One way to help instill trust in the system is to ask the product manager for a list of positive or negative examples that the model will always predict correctly given a prediction threshold. In essence, what is the midpoint of the data distribution for sentiment. For example:

Negative sentiment micro tests:?

  • “I hate this product”
  • “The UX is terrible and I can’t tell what to do”
  • “The app used to be good but now it sucks”?

Positive sentiment micro tests:

  • “This app is great!”
  • “Super easy to use”
  • “My favorite app in the app store”


On the contrary, examples like “The app tends to open when I click it” or “Most apps like this require an extensive amount of work” don’t reflect what we want to predict, sentiment, and wouldn’t be good tests.?


Once we have these, we can create test cases that each new and existing model is required to pass as part of a continuous integration test suite. Now, not only will you have guardrails for how your model performs, but the product manager can see examples and build a mental model for how your system will work in production. This, alongside a continuous stream of examples with predictions, is helpful in understanding ML systems.

No alt text provided for this image

We want examples that are in the middle of the data distribution we are predicting against

In contrast to the individual examples of Micro testing, Macro tests aim to capture aggregate performance metrics, e.g. precision and recall, similar to a train/test/validate split of your data. There are two goals of Macro tests:

  1. Generate aggregate metrics to validate model performance against external benchmarks (e.g. we need to have >95% precision for this use case)
  2. Compare new model hyperparameters or model architecture against each other in an apples-to-apples fashion.

The first is more straightforward. Split your data into 80/10/10 and use the last 10% to assess your macro/micro f1, precision, and recall scores. The actual performance requirements will depend on the business cost of an incorrect prediction as well as if the prediction is an opt-in interaction (e.g.: here is a suggestion that you can use) versus an opt-out interaction (e.g.: the system has taken a proactive action on your behalf).?

The spirit of the second is to test new models, or hyperparameters, on the same 10% holdout validation split. If you are comparing two models at the same point in time, this is easy to implement as shown below. Here, we are training on data from February 2019 - January 2020, testing on the month of data from January 2020 - February 2020 and validating on the month of data from February 2020 - March 2020. As shown below, there is a one month validation split for this data set.

No alt text provided for this image

Here we can see that both models are validated on the same validation dataset. Thus allowing a direct comparison in performance


However, as new data rolls in over time, the validate split should update to reflect this; the best way to test the generalizability of your systems is by validating on the most recent data. Following the example above, if we are training a new model architecture in April 2020, the validation split will shift to be from March 2020 to April 2020. Therefore, in order to ensure our Macro tests are valid across models, all existing, previously trained, and newly trained models need to use the same validation split of March-April 2020. It’s likely that a model trained on more recent data data performs better. However, this isn’t guaranteed. In my opinion, this is the best way to test generalizability of your architecture as new data comes in.?

Testing in machine learning is tricky due to the non-deterministic nature of the approaches used. Especially when building systems that involve people not as familiar with the details as yourself, to them a prediction with 63% confidence in may mean nothing. Therefore, this pattern is used to (1) Improve confidence that your model will do what you, and others, think it will do and (2) Create a process for updating models / parameters in an apples-to-apples fashion. I’ve found this framework useful for accomplishing both.

If you're interested, here are some additional resources on testing machine learning systems.

https://www.jeremyjordan.me/testing-ml/

https://developers.google.com/machine-learning/testing-debugging


Thanks for reading!


要查看或添加评论,请登录

Jonathan Hilgart的更多文章

  • Leading, and Pairing on, ML projects

    Leading, and Pairing on, ML projects

    Most machine learning projects sit squarely in the intersection between "spend two years on this and get back to me"…

  • Real-time audio processing on a Pi

    Real-time audio processing on a Pi

    Over the past couple of days, I've been working on visualizing the frequency spectrum of audio. My goal was to wire up…

    2 条评论
  • Good Ol' Algorithms

    Good Ol' Algorithms

    Did you know it is estimated people make 35,000 decisions per day! Most of these decisions are binary, but some of…

    1 条评论
  • Primitive functions and the bias they introduce

    Primitive functions and the bias they introduce

    Noah Chomsky has a theory that humans have an innate ability to learn language. This ability to learn a language is…

    3 条评论
  • FHIR is Fire

    FHIR is Fire

    You remember the last time you texted an Android phone (yes, I’m assuming you’re an iPhone user) and a green bubble…

    2 条评论
  • How Computers Learn to See

    How Computers Learn to See

    Professional wide receivers have been known to catch a football with their eyes closed. Hockey goalies predict where…

  • Information Security & Social Inclusion

    Information Security & Social Inclusion

    I’m currently reading through the book Computer Security: Principles and Practice which discusses security protocols…

    1 条评论
  • Inductive Bias for Different Machine Learning Approaches

    Inductive Bias for Different Machine Learning Approaches

    Source - https://www.researchgate.

  • How to Teach Computers to be `Intelligent`

    How to Teach Computers to be `Intelligent`

    Does this mean anything to you? If this were a Rorschach test, i.e.

  • Computational Time-Lapse

    Computational Time-Lapse

    Whenever you've seen a time-lapse video, it probably looked fairly 'jumpy' from one frame to the next with objects and…

社区洞察

其他会员也浏览了