Robust AI: Rethinking Data Strategies
?? Alastair Muir, PhD, BSc, BEd, MBB
Data Science Consultant | @alastairmuir.bsky.social | Risk Analysis and Optimization
In the world of data, we often find ourselves working with whatever information comes our way. But is that always the best approach?
Traditionally, we assumed uniformity in observation errors, and techniques like SMOTE and ADASYN have been our go-to solutions to handle equal subsample sizes. However, these are workarounds to handle statistically based assumptions. Hoping the Central Limit Theorem will protect us is a leap of faith.
Taking a page from Andrew Ng 's data-centric A.I. playbook, I've adopted a different strategy. I believe in vetting input data, classifying it into subgroups of varying quality, and focusing on the cream of the crop – essentially, a data-focused downsampling approach but using a continuous classification. This concept initially struck me while working with retinal images for detecting degenerative retinopathy, where some images shine while others fall short.
My approach? I examine all observations for reliability. This allows me to incorporate errors in input and variables and utilize weighting schemes that can integrate with most machine learning algorithms.
“The code is more what you’d call guidelines than actual rules”
Co-Founder & Chief Scientist @ Cleanlab | CS PhD from MIT
1 年nice article! You might be interested in a tool I built that automatically provides such categorizations of your data (works for image, text, tabular datasets): https://cleanlab.ai/blog/automated-data-quality-at-scale/