The Data is the Algorithm (2/5)
Gary M. Shiffman
Economist ? 2x Artificial Intelligence company co-founder ? Writer
Look again at the chihuahua-muffin images from Article 1; can you identify the chihuahuas? Easy enough. But what if you had to identify all the chihuahuas out of an array of 200,000 or 20 million images? Assume for a moment that your employer has an important reason for this task. For the purposes of this thought experiment, assume “chihuahua” represents a searched-for behavior, such as human trafficking, risky correspondents, or sanctions violations.?
As a human, this search across entire customer populations would be possible but not feasible. Nobody would want this job, and error rates would be high. In the real world of financial crimes compliance, banks have historically avoided these searches across entire populations and relied upon rules-based alerts instead. But enter Machine Learning (ML), and the overwhelming, or seemingly impossible, task becomes feasible.
Machine Learning is learning by example (“inductive”). People building ML algorithms need examples. Want to identify chihuahua images? Feed an algorithm many examples of chihuahuas. More specifically, task humans with identifying a large population of chihuahuas, label these as such, and then identify a large sample of images of “not-chihuahua”, and label them as such. That’s it. The data is the algorithm. If your sample size is large and properly labeled, you’ve got a good algorithm. The very best algorithms are those trained on the “most best data” – the training data.??
Why muffins? Because muffins challenge the algorithm. Muffins represent the “not-chihuahuas”, and the food-dog meme entertains because of the similarities. The training data of chihuahua images creates the chihuahua discovery tool – an algorithm able to distinguish chihuahua images from across the internet index.?
领英推荐
What could go wrong? Not enough training data or bad training data. The algorithm is the data, so if you only have 10 chihuahua images, your algorithm will likely miss most target images across a large population of possible chihuahuas. In a large training sample, if another dog breed such as “pug” is improperly labeled as “chihuahua”, then any algorithm trained on this large set of pug and chihuahua images will learn the error. An algorithm has no consciousness, like a child might have; teach the computer that “chihuahua” = pug or chihuahua, and the algorithm will work and identify both. In this instance, the algorithm has picked up a “pug” bias.?
Consider the consequences of these training data errors when working in high-consequence fields. A chihuahua mistaken for a muffin means a lot more when a crime-fighting team uses an algorithm which identifies a “not-drug-trafficker” as a likely “drug trafficker.”?
If you work in the financial crimes and compliance world, then to identify a human trafficker or an elder fraud scammer, you need to build an algorithm using the “most best” training data – properly labeled examples of known criminals. The more known criminal data available for training, the better the algorithm, because the data is the algorithm.?
What does “better” mean, and can this be measured? In the next article, I will discuss the importance of testing data in building and evaluating AI/ML algorithms.
This is the second article of a 5-part series. See my video short on this topic, read Article 1 here, or move on to Article 3.
Author, Social Scientist, Researcher, Improving Outcomes by Reducing Bias | Expert Facilitator for Hospitals, Associations, Universities, and Corporations.
2 年Gary M. Shiffman, love this buddy. I see the analogy with any form of decision-making. My takeaway (and I look forward to the next three parts) from this piece is to examine the intended (desired) outcomes for observable (factual) data, cull that data out and compare it to my unknown data for probability. If I got that wrong, please edify. Thanks!