Testing Data and Drawing the Threshold (3/5)
Gary M. Shiffman
Economist ? 2x Artificial Intelligence company co-founder ? Writer
In my previous articles, I introduced Machine Learning (ML), training data, and the source of accuracy and bias and I made assertions about building “better” algorithms. Now, let’s unpack “better” and how to measure algorithmic performance.?
Remember, in this series, “chihuahua” can stand in for anything you seek to discover. You created a large sample of properly labeled data, the training data, and fed that to an algorithm, creating a chihuahua algorithm.?
The output of any ML algorithm is a distribution. Along the x- or horizontal axis, you have a measure of chihuahua-ness, sometimes referred to as the algorithm’s confidence in “predicting” the entity to be chihuahua. Along the y- or vertical axis, you have the count of entities.?
Once the algorithm creates the distribution, the human must perform the single most important task:? draw the threshold. In my decade-plus of working with ML systems, this is perhaps the most misrepresented aspect of the art of deploying AI/ML technologies into high-consequence operational environments.?
Machines have no conscious awareness of right and wrong; humans must do this. How many images must be treated as “alerts” and sent for human review? A data scientist might say that the algorithm “predicted” what entity is of interest to the operator. But the prediction requires a threshold. And a threshold depends upon particular risk profiles and risk preferences. The machine only creates the distribution using training data provided by humans. The human makes the next move of drawing a threshold.??
In a small population, a person can easily identify the chihuahuas from the no-chihuahuas. But finding the sought-after pattern across the large data set, the ML goes back to work by using the labeled data and creating test data.?
In the image here of 500 entities in a post-algorithm distribution, the test data, only the labeled chihuahua images appear in color for the purpose of this article; the computer can “see” the label. The human-drawn threshold tells the system to treat scores of eight and above as-if chihuahua, and seven and below as-if not-chihuahua. Now, we can measure performance.??
First, we count True Positives, False Positives, True Negatives, and False Negatives.??
Above (right of) the threshold = Predicted Positive
领英推荐
Chihuahuas above the threshold = True Positive
Not-chihuahuas above the threshold = False Positive
Below (left of) the threshold = Predicted Negative
Chihuahuas below the threshold = False Negatives
Not-Chihuahuas below the threshold = True Negative
Looking at the image, every chihuahua above the threshold that is actually positive is called a true positive – the algorithm-human team got it right. Everything above the threshold that isn’t a chihuahua is a false positive.
Similarly, “not-chihuahuas” below the threshold are true negatives – a win for team algorithm-human. All chihuahuas below the threshold are false negatives – human traffickers and money launderers that evaded us, again. Counting and some simple math gets us to the measurement of accuracy:? effectiveness and efficiency. In the next article, I will dive into accuracy in more detail.?
This is the second article of a 5-part series. See my video short on this topic or read Article 2 here.