登录查看更多内容

Robust AI: Rethinking Data Strategies

?? Alastair Muir, PhD, BSc, BEd, MBB

Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization

发布日期: 2023年11月2日

In the world of data, we often find ourselves working with whatever information comes our way. But is that always the best approach?

Traditionally, we assumed uniformity in observation errors, and techniques like SMOTE and ADASYN have been our go-to solutions to handle equal subsample sizes. However, these are workarounds to handle statistically based assumptions. Hoping the Central Limit Theorem will protect us is a leap of faith.

Taking a page from Andrew Ng 's data-centric A.I. playbook, I've adopted a different strategy. I believe in vetting input data, classifying it into subgroups of varying quality, and focusing on the cream of the crop – essentially, a data-focused downsampling approach but using a continuous classification. This concept initially struck me while working with retinal images for detecting degenerative retinopathy, where some images shine while others fall short.

My approach? I examine all observations for reliability. This allows me to incorporate errors in input and variables and utilize weighting schemes that can integrate with most machine learning algorithms.

Philosophically, we’re emphasizing the most reliable data free from gross errors
Mathematically, we're adjusting the weight of observations based on their reliability.
Statistically, we're setting weights proportionate to the inverse variance of each observation.
Algorithmically, we're specifying weight factors for observations.

“The code is more what you’d call guidelines than actual rules”

Jonas Mueller

Co-Founder & Chief Scientist @ Cleanlab | CS PhD from MIT

1 年

nice article! You might be interested in a tool I built that automatically provides such categorizations of your data (works for image, text, tabular datasets): https://cleanlab.ai/blog/automated-data-quality-at-scale/

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Robust AI: Rethinking Data Strategies

?? Alastair Muir, PhD, BSc, BEd, MBB

Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization

更多精彩文章

社区洞察

其他会员也浏览了

When HAL 9000 Meets Your Hunches: Balancing Intuition and Analytics in the GenAI Era

Be a human in the age of Algorithm or Artificial Intelligence.

Making AI Real - From Data to Wisdom

AI/ML Digest | Issue 33

Using AI to Uncover Lost Faces: From Numbers to Names

You’re being manipulated by Artificial Intelligence, and it’s not even sentient yet. Here’s why you should be concerned.

Smarter Business with AI: Unlocking Tomorrow

Vicious cycle of Bias in AI systems

Artificial Intelligence in the news

THE PATH TO SIGNULARITY IS NOT LINEAR – IT’LL BE HARDER AND HARDER

Military Use of Machine Learning “Magic Powder” in Gaza

2024年4月12日

SHAP is not all you need (or why you should always use permutation feature importance)

2024年3月10日

Why Does My Model Not Generalize Well?

2023年12月27日

ChatGPT vs Gemini: What Does a Game Changer in AI Look Like?

2023年12月7日

Data Science and Data Engineering

2023年10月17日

Data scientists: How to talk with your subject matter experts

2020年10月23日

Why 85% of AI and ML initiatives fail*

2020年9月4日

Reporting on your data science results can be a full time job

2017年10月3日

Machine Learning - Old Fish in New Paper

2017年7月12日

Why You Need Projects in a CI Program

2017年7月3日