Superhuman AI, or not? What does "human level" performance mean?

Superhuman AI, or not? What does "human level" performance mean?

You can't turn left or right without seeing a claim about an AI achieving human level, or superhuman, performance. What does it all mean? I've dug deep into the topic to try to make sense of it. As usual, all sources are included at the end.

What are all these claims about AI with human performance?

The claims themselves are often less interesting than what these performance tests of AI and humans really tell us. First let's see what they are all asserting in papers and announcements.

There are 3 types of claim of "human" or "beyond human" performance of AI that I see repeatedly:

  • pattern recognition and labeling
  • reasoning reminiscent of humans
  • emergent capabilities

Patterns and labelings

From my research and my time on social media (which is considerable, for better or for worse), I have noted that the most common claims are about pattern recognition and labeling problems. It is this one, the first of the three, that I will focus on in this article. I'll do separate deep dives on the other two because they are very exciting and complex and they deserve close attention.

Common pattern-to-label tasks are, for example, transcribing speech, labeling natural language text with meanings or labeling images with the main item visible. In other words the input is a pattern, e.g. the 2D pixel array of an image, or a string of words in text form, or an audio file with a digitally encoded human speech recording. The output is a label or several labels.

So, as an example, if an audio file of Obi Wan Kenobi speaking is the pattern presented, the output labels would be something like “Strike me down and I will become more powerful than you could possibly imagine.”

(Did I slip in the Star Wars reference naturally?)

Image-to-label tasks

My approach, whenever I see a claim about human level performance, is to actually try the pattern-to-label task myself to get a feel for what is really involved. I sometimes post these task challenges on social media to give others a quick way to try it out. For example, here are five images I posted recently on Linkedin and asked people to try to give one label for the main item in each image.

No alt text provided for this image

If you try to label these yourself, you will quickly see some of the problems involved. First it is sometimes hard to tell what is even the main object: Is it the reporter or the van in the top right?

Even more revealing is to compare the labelings of a bunch of people. So, for image 4, I got the following labels from just the first 5 respondents:


"Asian food", "Korean meal", "Liver", "Asian Restaurant (or maybe Liver-Feast?)", "liver hot pot". Imagine the divergence of labels that human annotators will produce on 1500 test images in this data set!

So one quickly sees that humans actually have to learn a bunch of sometimes restrictive and unnatural labels to even be able to take part in the measurement of the human performance benchmark. In fact Andrej Karpathy, of later Tesla and OpenAI fame, notes that he invested a lot of hours learning how to apply the labels to the ImageNet data set in 2014.

Karpathy said that, "It was hard. ... I only enjoyed the first ~200, and the rest I only did?#forscience."

However, this work reveals both just how good the machines are and, at the same time, that the limitations of the AI are different from the limitations of humans. Karpathy concluded the following:

"It is clear that humans will soon only be able to outperform state of the art image classification models by use of significant effort, expertise, and time."

In 2014, Andrej Karpathy got his error rate down to 5.1% while the AI model, GoogleNet, had an error rate of 6.8%. So humans were beating AI at that time, but Karpathy could already see the writing on the wall. Indeed by 2020, there were numerous results replicated by different teams with AI achieving error rates of between 1.3% and 2% on this same image labeling task.

But here's the thing - and it's the message that I'm constantly telling everyone who will listen - it's perfectly possible for the following two statements to be true simultaneously:

  • AI can beat humans at pattern-to-label tasks on a specific success metric
  • The AI makes the most stupid mistakes imaginable

Karpathy very helpfully points out in his summary that the AI has difficulty with the following occurrences in images: "...closeups of parts of an object, unconventional viewpoints such as a rotated image, images that can significantly benefit from the ability to read text (e.g. a featureless container identifying itself as “face powder”), objects with heavy occlusions, and images that depict a collage of multiple images."

He lists some more areas where AI has problems:

"[The AI] struggles with images that depict objects of interest in an abstract form, such as 3D-rendered images, paintings, sketches, plush toys, or statues."

You might think that AI has got better at these things since 2014. But the very paper that is often cited by those making big claims about AI outperforming humans (D. Kiela et al in 2020) clearly states that:

"Models that achieve super-human performance on benchmark tasks (according to the narrow criteria used to define human performance) nonetheless fail on simple challenge examples and falter in real-world scenarios."
No alt text provided for this image

This chart from that paper is one of the most reproduced on the internet. It usually comes with a headline like "AI performing better than humans on a wide variety of tasks". The paper actually is a lot more nuanced.

(See Sources and Notes at the end for details.)


As you might know, I am an AI enthusiast. Deploying AI-enabled conversational systems is what I do. I just think that us AI technologists should be very honest with each other, and the people we advise, and the public in general. So I'll say it again:

"Deploying any AI brings trade-offs along with it. Often the benefits far outweigh the drawbacks, but the trade-offs have to be dealt with. In particular, AI will sometimes be super dumb!"


Speech transcription tasks

This is a field close to my heart. My first job in machine learning and AI was focused on speech transcription and on natural language understanding. So I have been close to the development of this technology for close to a quarter century.

So how good is it at the moment?

The state-of-the-art in performance on benchmarks is very impressive. Word error rates on audiobook recordings has dropped from 6% to between 1.4% and 2% from 2016 to 2022.

That's pretty amazing. If you think about the challenge for a human of transcribing audio without replaying parts, then you will readily understand that it is hard for a person to get down to 1 or 2 words wrong in every 100.

On the other hand, you can find well documented research that word error rates on other transcription tasks give considerably higher error rates. For example, it has been recently demonstrated that word error rates on German language oral histories (i.e. recordings of people talking of their experiences and historical events that they observed) is considerably worse.

Error rates on oral histories are between about 15% and 23% depending on the audio quality. Note that the 15% is the word error rate even on very high quality audio.

Hidden from view in these results - where the metric is word error rate - are the outputs from speech transcription that are totally unusable or that any human would find very peculiar.

When I speak at events, I sometimes do interactive analysis with the audience of exactly how good machines are compared to humans on speech input. What you see from such exercises is that the machines are phenomenally good at writing out speech in words when we look at some specific performance dimensions.

For example, speed and consistency are far better than what humans can achieve. Also, specialized models are sometimes able to do a good job transcribing speech that humans have difficulty getting right. But what we also usually find out together is that the machines are terrible at things that are reasonably easy for humans.

For example, AI is pretty lousy at capturing email addresses or full house, street, town and postcode addresses. The AI is usually also bad at any novel or out-of-vocabulary words.

But as we saw with image labeling above, the incredibly good results that can be achieved mean that there are lots of situations where it is now a no-brainer for companies to use speech processing.

"For most business applications, in practice, current performance of speech transcription is more than adequate. Indeed it can bring huge benefits, and the trade-offs are relatively easy to manage. But careful design and dialog management decisions are needed to get the desired outcomes."

In general for most pattern labeling tasks, I would go so far as to say that it is essential for businesses to invest in AI. We are living in an exciting time, when current performance of commercially available AI solutions is really good. So it also means that competitive advantage and cost efficiency are both likely to suffer for many businesses where there are a lot of pattern-to-label processes being handled without AI.

I would just caution that, in all such uses of AI, there will be significant need to deal with the consequences. Getting the benefits means planning from the start to have the design and technical discipline and procedures to deal with the inevitable trade-offs. As in every previous innovation cycle, the test of success is how well you get these technologies integrated into end-to-end business processes.

Sources & Notes

To give full context to anyone wanting to dig deeper into this topic, here are my sources and notes for each part of the article. Please let me know if you spot any error or omission. I'll be happy to address it.

On the cited references for pattern recognition and labeling

One of the cited papers with the performance over time of AI on benchmarks where human performance data is also available is from the paper by D. Kiela et al called "Rethinking Benchmarking in NLP". Here is the graph from that source. As you can see the different "success rates" are normalized so that -1 is the starting point at the first measurement in the sample of results and 0 (zero) is the human performance level available for each dataset.

No alt text provided for this image


Here's the original paper:


And here is the version reproduced in a Sequoia post. Variants of this image have been reproduced in many publications and presentations.

No alt text provided for this image

First of all, yes it is the case that on existing benchmarks for handwriting recognition, speech recognition, image labeling, reading comprehension and language understanding, AIs are achieving results at or in excess of humans on the same task. But as Kiela et al note explicitly in the paper, "These are remarkable achievements, but there is an extensive body of evidence indicating that these models do not in fact have the human level natural language capabilities one might be lead to believe." They go on to state "Models that achieve super-human performance on benchmark tasks (according to the narrow criteria used to define human performance) nonetheless fail on simple challenge examples and falter in real-world scenarios."

So the very paper so often cited as evidence of superhuman performance points out that this is not necessarily the case. Indeed the purpose of the paper is to indicate that the benchmarking approach needs to be improved.

Below is the Sequoia post that uses the performance benchmark chart above. I take a more nuanced view to what they say about performance: "Sure enough, as the models get bigger and bigger, they begin to deliver human-level, and then superhuman results." However I completely agree with their main argument:

"With the platform layer solidifying, models continuing to get better/faster/cheaper, and model access trending to free and open source, the application layer is ripe for an explosion of creativity."


On the performance of AI vs humans on speech transcription tasks

Here is a good overview of the performance of different AI models on a single well known audio dataset between 2016 and 2022. As you can see word error rate has dropped to under 2%.

No alt text provided for this image

You can find that and the list of the relevant publications here:

Despite these impressive results, we must also take into account well-documented research that reveals word error rates much worse than humans on other data sets or under different conditions. For example, there are rather large data corpora of oral history recordings, where people who witnessed historical events recount their stories and experiences. M. Gref et al recounted in their paper from Jan 2022 that

"We estimate a human word error rate of 8.7% for recent German oral history interviews with clean acoustic conditions. ... and achieve 23.9% WER on noisy and 15.6% word error rate on clean oral history interviews [using best optimized AI models]."

Those results will surely get better. The point is that at any time that we deploy speech transcription, we must acknowledge that the performance we achieve might not be the very cream of the crop at under 2% WER but can creep up to be a larger number. The actual paper from M. Gref et al is here:




On the inherent difficulties when labeling patterns

Here is one of the posts I published that asks people to try labeling patterns themselves. Check it out and see how different folks on linkedin labeled the images.

The original source is this labeling interface that was created by Andrej Karpathy, of Tesla fame, back in 2014. He painstakingly learned the labeling approach so that he could help establish the human performance baseline on the imagenet data set.


要查看或添加评论,请登录

Pierce Buckley的更多文章

社区洞察

其他会员也浏览了