Robust AI: Rethinking Data Strategies
Derek Mellott

Robust AI: Rethinking Data Strategies

In the world of data, we often find ourselves working with whatever information comes our way. But is that always the best approach?

Traditionally, we assumed uniformity in observation errors, and techniques like SMOTE and ADASYN have been our go-to solutions to handle equal subsample sizes. However, these are workarounds to handle statistically based assumptions. Hoping the Central Limit Theorem will protect us is a leap of faith.

Taking a page from Andrew Ng 's data-centric A.I. playbook, I've adopted a different strategy. I believe in vetting input data, classifying it into subgroups of varying quality, and focusing on the cream of the crop – essentially, a data-focused downsampling approach but using a continuous classification. This concept initially struck me while working with retinal images for detecting degenerative retinopathy, where some images shine while others fall short.

My approach? I examine all observations for reliability. This allows me to incorporate errors in input and variables and utilize weighting schemes that can integrate with most machine learning algorithms.

  • Philosophically, we’re emphasizing the most reliable data free from gross errors
  • Mathematically, we're adjusting the weight of observations based on their reliability.
  • Statistically, we're setting weights proportionate to the inverse variance of each observation.
  • Algorithmically, we're specifying weight factors for observations.

“The code is more what you’d call guidelines than actual rules”

Jonas Mueller

Co-Founder & Chief Scientist @ Cleanlab | CS PhD from MIT

1 年

nice article! You might be interested in a tool I built that automatically provides such categorizations of your data (works for image, text, tabular datasets): https://cleanlab.ai/blog/automated-data-quality-at-scale/

要查看或添加评论,请登录

社区洞察

其他会员也浏览了