登录查看更多内容

Testing Machine Learning Systems

Jonathan Hilgart

Machine Learning, LLMs, and Information Retrieval

发布日期: 2021年8月10日

Testing in software development is almost a science at this point. You’ve probably heard the quip, “Write tests, some unit, mostly integration”. The idea here is the most likely parts of your software to break are the interactions between systems. When another team has to call your API or use a schema from your messaging queue, these interactions traverse teams or departments, which means the full context of how the interaction should work is fragmented. Hence, an emphasis on integration tests which ensure that APIs, components, and separately contract testing, all work as intended.? With machine learning systems, not only do we need to test the interactions between different components, but also the properties of the non-deterministic model we’ve trained. I call these Macro and Micro tests; Micro verifies canonical examples while Macro provides aggregate performance snapshots of your model.

The goal of Micro tests is to place guardrails on your model. Imagine a product manager wants to understand the sentiment of product reviews. However, this person isn’t familiar with machine learning and leaves the rest of the problem definition to you. One way to help instill trust in the system is to ask the product manager for a list of positive or negative examples that the model will always predict correctly given a prediction threshold. In essence, what is the midpoint of the data distribution for sentiment. For example:

Negative sentiment micro tests:?

“I hate this product”
“The UX is terrible and I can’t tell what to do”
“The app used to be good but now it sucks”?

Positive sentiment micro tests:

“This app is great!”
“Super easy to use”
“My favorite app in the app store”

On the contrary, examples like “The app tends to open when I click it” or “Most apps like this require an extensive amount of work” don’t reflect what we want to predict, sentiment, and wouldn’t be good tests.?

Once we have these, we can create test cases that each new and existing model is required to pass as part of a continuous integration test suite. Now, not only will you have guardrails for how your model performs, but the product manager can see examples and build a mental model for how your system will work in production. This, alongside a continuous stream of examples with predictions, is helpful in understanding ML systems.

We want examples that are in the middle of the data distribution we are predicting against

In contrast to the individual examples of Micro testing, Macro tests aim to capture aggregate performance metrics, e.g. precision and recall, similar to a train/test/validate split of your data. There are two goals of Macro tests:

Generate aggregate metrics to validate model performance against external benchmarks (e.g. we need to have >95% precision for this use case)
Compare new model hyperparameters or model architecture against each other in an apples-to-apples fashion.

领英推荐

AI as Transformative Influence in Software Development…

Cyfuture 5 个月前

Machine Learning in Test Automation

AlgoShack 1 年前

Machine Learning vs. Traditional Software Development…

Satish Chandra Gupta 2 年前

The first is more straightforward. Split your data into 80/10/10 and use the last 10% to assess your macro/micro f1, precision, and recall scores. The actual performance requirements will depend on the business cost of an incorrect prediction as well as if the prediction is an opt-in interaction (e.g.: here is a suggestion that you can use) versus an opt-out interaction (e.g.: the system has taken a proactive action on your behalf).?

The spirit of the second is to test new models, or hyperparameters, on the same 10% holdout validation split. If you are comparing two models at the same point in time, this is easy to implement as shown below. Here, we are training on data from February 2019 - January 2020, testing on the month of data from January 2020 - February 2020 and validating on the month of data from February 2020 - March 2020. As shown below, there is a one month validation split for this data set.

Here we can see that both models are validated on the same validation dataset. Thus allowing a direct comparison in performance

However, as new data rolls in over time, the validate split should update to reflect this; the best way to test the generalizability of your systems is by validating on the most recent data. Following the example above, if we are training a new model architecture in April 2020, the validation split will shift to be from March 2020 to April 2020. Therefore, in order to ensure our Macro tests are valid across models, all existing, previously trained, and newly trained models need to use the same validation split of March-April 2020. It’s likely that a model trained on more recent data data performs better. However, this isn’t guaranteed. In my opinion, this is the best way to test generalizability of your architecture as new data comes in.?

Testing in machine learning is tricky due to the non-deterministic nature of the approaches used. Especially when building systems that involve people not as familiar with the details as yourself, to them a prediction with 63% confidence in may mean nothing. Therefore, this pattern is used to (1) Improve confidence that your model will do what you, and others, think it will do and (2) Create a process for updating models / parameters in an apples-to-apples fashion. I’ve found this framework useful for accomplishing both.

If you're interested, here are some additional resources on testing machine learning systems.

https://www.jeremyjordan.me/testing-ml/

https://developers.google.com/machine-learning/testing-debugging

Thanks for reading!

要查看或添加评论，请登录

Jonathan Hilgart的更多文章

Leading, and Pairing on, ML projects

2021年12月28日

Leading, and Pairing on, ML projects

Most machine learning projects sit squarely in the intersection between "spend two years on this and get back to me"…
Real-time audio processing on a Pi

2021年7月3日

Real-time audio processing on a Pi

Over the past couple of days, I've been working on visualizing the frequency spectrum of audio. My goal was to wire up…

2 条评论
Good Ol' Algorithms

2021年5月14日

Good Ol' Algorithms

Did you know it is estimated people make 35,000 decisions per day! Most of these decisions are binary, but some of…

1 条评论
Primitive functions and the bias they introduce

2020年12月28日

Primitive functions and the bias they introduce

Noah Chomsky has a theory that humans have an innate ability to learn language. This ability to learn a language is…

3 条评论
FHIR is Fire

2020年8月1日

FHIR is Fire

You remember the last time you texted an Android phone (yes, I’m assuming you’re an iPhone user) and a green bubble…

2 条评论
How Computers Learn to See

2020年1月2日

How Computers Learn to See

Professional wide receivers have been known to catch a football with their eyes closed. Hockey goalies predict where…
Information Security & Social Inclusion

2019年6月12日

Information Security & Social Inclusion

I’m currently reading through the book Computer Security: Principles and Practice which discusses security protocols…

1 条评论
Inductive Bias for Different Machine Learning Approaches

2019年3月3日

Inductive Bias for Different Machine Learning Approaches

Source - https://www.researchgate.
How to Teach Computers to be `Intelligent`

2019年1月7日

How to Teach Computers to be `Intelligent`

Does this mean anything to you? If this were a Rorschach test, i.e.
Computational Time-Lapse

2018年10月28日

Computational Time-Lapse

Whenever you've seen a time-lapse video, it probably looked fairly 'jumpy' from one frame to the next with objects and…

See all articles

Testing Machine Learning Systems

Jonathan Hilgart

Machine Learning, LLMs, and Information Retrieval

领英推荐

Jonathan Hilgart的更多文章

社区洞察

其他会员也浏览了

Making the Context Shift: From Traditional Software Development to AI

The Role of Machine Learning in Predictive Software Testing

Leveraging AI and Machine Learning to Enhance Software Testing Processes

How AI and Machine Learning are Transforming Software Development

How AI-driven development is reshaping the tech landscape

Will AI Replace Software Engineers?

AI-Enabled Software Development: Accelerating Innovation and Time-to-Market

Why AI? The Imperative of Artificial Intelligence in Software Engineering

How Prompt Engineering is Shaping the Future of Software Development

Defect Prediction Models in Quality Engineering: Tools, Applications, and Future Directions

领英推荐

Jonathan Hilgart的更多文章

Leading, and Pairing on, ML projects

Real-time audio processing on a Pi

Good Ol' Algorithms

Primitive functions and the bias they introduce

FHIR is Fire

How Computers Learn to See

Information Security & Social Inclusion

Inductive Bias for Different Machine Learning Approaches

How to Teach Computers to be `Intelligent`

Computational Time-Lapse

社区洞察

其他会员也浏览了

Making the Context Shift: From Traditional Software Development to AI

The Role of Machine Learning in Predictive Software Testing

Leveraging AI and Machine Learning to Enhance Software Testing Processes

How AI and Machine Learning are Transforming Software Development

How AI-driven development is reshaping the tech landscape

Will AI Replace Software Engineers?

AI-Enabled Software Development: Accelerating Innovation and Time-to-Market

Why AI? The Imperative of Artificial Intelligence in Software Engineering

How Prompt Engineering is Shaping the Future of Software Development

Defect Prediction Models in Quality Engineering: Tools, Applications, and Future Directions