登录查看更多内容

The Data is the Algorithm (2/5)

Gary M. Shiffman

Economist ? 2x Artificial Intelligence company co-founder ? Writer

发布日期: 2022年11月16日

Look again at the chihuahua-muffin images from Article 1; can you identify the chihuahuas? Easy enough. But what if you had to identify all the chihuahuas out of an array of 200,000 or 20 million images? Assume for a moment that your employer has an important reason for this task. For the purposes of this thought experiment, assume “chihuahua” represents a searched-for behavior, such as human trafficking, risky correspondents, or sanctions violations.?

A collage of hundreds of Chihuahuas and Muffins, looking nearly alike.

As a human, this search across entire customer populations would be possible but not feasible. Nobody would want this job, and error rates would be high. In the real world of financial crimes compliance, banks have historically avoided these searches across entire populations and relied upon rules-based alerts instead. But enter Machine Learning (ML), and the overwhelming, or seemingly impossible, task becomes feasible.

Machine Learning is learning by example (“inductive”). People building ML algorithms need examples. Want to identify chihuahua images? Feed an algorithm many examples of chihuahuas. More specifically, task humans with identifying a large population of chihuahuas, label these as such, and then identify a large sample of images of “not-chihuahua”, and label them as such. That’s it. The data is the algorithm. If your sample size is large and properly labeled, you’ve got a good algorithm. The very best algorithms are those trained on the “most best data” – the training data.??

Why muffins? Because muffins challenge the algorithm. Muffins represent the “not-chihuahuas”, and the food-dog meme entertains because of the similarities. The training data of chihuahua images creates the chihuahua discovery tool – an algorithm able to distinguish chihuahua images from across the internet index.?

领英推荐

Deepfakes – The Good, The Bad, And The Ugly

Bernard Marr 3 年前

Against AI anthropomorphism

Luiza Jarovsky 1 年前

Code vs Algorithm vs AI (LLM): Data Privacy

Concur - Consent Manager 1 个月前

What could go wrong? Not enough training data or bad training data. The algorithm is the data, so if you only have 10 chihuahua images, your algorithm will likely miss most target images across a large population of possible chihuahuas. In a large training sample, if another dog breed such as “pug” is improperly labeled as “chihuahua”, then any algorithm trained on this large set of pug and chihuahua images will learn the error. An algorithm has no consciousness, like a child might have; teach the computer that “chihuahua” = pug or chihuahua, and the algorithm will work and identify both. In this instance, the algorithm has picked up a “pug” bias.?

Consider the consequences of these training data errors when working in high-consequence fields. A chihuahua mistaken for a muffin means a lot more when a crime-fighting team uses an algorithm which identifies a “not-drug-trafficker” as a likely “drug trafficker.”?

If you work in the financial crimes and compliance world, then to identify a human trafficker or an elder fraud scammer, you need to build an algorithm using the “most best” training data – properly labeled examples of known criminals. The more known criminal data available for training, the better the algorithm, because the data is the algorithm.?

What does “better” mean, and can this be measured? In the next article, I will discuss the importance of testing data in building and evaluating AI/ML algorithms.

This is the second article of a 5-part series. See my video short on this topic, read Article 1 here, or move on to Article 3.

Lorne Epstein MSOD, SHRM-SCP

Author, Social Scientist, Researcher, Improving Outcomes by Reducing Bias | Expert Facilitator for Hospitals, Associations, Universities, and Corporations.

2 年

Gary M. Shiffman, love this buddy. I see the analogy with any form of decision-making. My takeaway (and I look forward to the next three parts) from this piece is to examine the intended (desired) outcomes for observable (factual) data, cull that data out and compare it to my unknown data for probability. If I got that wrong, please edify. Thanks!

要查看或添加评论，请登录

Gary M. Shiffman的更多文章

Bias Isn’t a “Given” in AI (5/5)

2022年12月5日

Bias Isn’t a “Given” in AI (5/5)

Thinking about the measurement of AI/ML in terms of chihuahuas and muffins, it turns out, is pretty easy and easy to…
How Do You Know If It Is Working? Measuring the Accuracy of AI (4/5)

2022年12月2日

How Do You Know If It Is Working? Measuring the Accuracy of AI (4/5)

As I mentioned in the first of this article series (link), accuracy is a measurement of both effectiveness and…
Testing Data and Drawing the Threshold (3/5)

2022年11月23日

Testing Data and Drawing the Threshold (3/5)

In my previous articles, I introduced Machine Learning (ML), training data, and the source of accuracy and bias and I…
What Do Chihuahuas and Muffins Have to Do with AI? (1/5)

2022年11月10日

What Do Chihuahuas and Muffins Have to Do with AI? (1/5)

I used a popular internet meme of “chihuahuas and muffins” a few years ago in a series of lectures to explain…

1 条评论
Update: Hushpuppi Pleads Guilty

2021年8月17日

Update: Hushpuppi Pleads Guilty

Update: Hushpuppi Pleads Guilty In January, I published a commentary in The Wall Street Journal discussing the…
Science and Collaboration: What the Financial Industry Can Learn from Vaccine Developers

2020年11月23日

Science and Collaboration: What the Financial Industry Can Learn from Vaccine Developers

In Axios’s Re:Cap podcast last week, host Dan Primack spoke with Moderna Chief Medical Officer Tal Zaks about the…
How to Engage in the Fight Against Cyber Threats

2020年8月7日

How to Engage in the Fight Against Cyber Threats

Last week at the amazingly successful American Bankers Association (ABA) Risk and Compliance Virtual (RCV) Conference…

2 条评论
Thoughts on Martin Luther King, Jr. Day 2020

2020年1月21日

Thoughts on Martin Luther King, Jr. Day 2020

I am moved this year to think and write more deeply on the occasion of Martin Luther King Jr. Day.

See all articles

The Data is the Algorithm (2/5)

Gary M. Shiffman

Economist ? 2x Artificial Intelligence company co-founder ? Writer

领英推荐

Gary M. Shiffman的更多文章

社区洞察

其他会员也浏览了

Understanding PDPC Guidelines on Use of Personal Data in AI Systems: Fostering Accountability and Transparency

Personal Data in context of AI Models - Excerpt of EDPB Opinion 28/2024

Responsible AI : Who Protects Our Data ?

Protecting Data Privacy with OpenAI

IA. ???????????? The Harvest of Big Data - What You Need to Know About Web Scraping and OpenAI's GPTBot. A Global Warning Issued by Data Protecwarning

Algorithm Bias Auditor: A Comprehensive Guide

ALGORITHMS BORN OF OUR PREJUDICES

Do machine learning and so-called artificial intelligence go hand in hand with privacy? Is pseudonymization the right way to go? How about a ban?

When data bugs kill people

Government 2.0: How AI is Powering the Digital Transformation of Public Services

领英推荐

Gary M. Shiffman的更多文章

Bias Isn’t a “Given” in AI (5/5)

How Do You Know If It Is Working? Measuring the Accuracy of AI (4/5)

Testing Data and Drawing the Threshold (3/5)

What Do Chihuahuas and Muffins Have to Do with AI? (1/5)

Update: Hushpuppi Pleads Guilty

Science and Collaboration: What the Financial Industry Can Learn from Vaccine Developers

How to Engage in the Fight Against Cyber Threats

Thoughts on Martin Luther King, Jr. Day 2020

社区洞察

其他会员也浏览了

Understanding PDPC Guidelines on Use of Personal Data in AI Systems: Fostering Accountability and Transparency

Personal Data in context of AI Models - Excerpt of EDPB Opinion 28/2024

Responsible AI : Who Protects Our Data ?

Protecting Data Privacy with OpenAI

IA. ???????????? The Harvest of Big Data - What You Need to Know About Web Scraping and OpenAI's GPTBot. A Global Warning Issued by Data Protecwarning

Algorithm Bias Auditor: A Comprehensive Guide

ALGORITHMS BORN OF OUR PREJUDICES

Do machine learning and so-called artificial intelligence go hand in hand with privacy? Is pseudonymization the right way to go? How about a ban?

When data bugs kill people

Government 2.0: How AI is Powering the Digital Transformation of Public Services