Why are LLMs just not good at some things?
I hear again and again, “AI is useless, it can’t even…”. So, I thought not only would I point out what AI is not good at and why, and also what it’s actually good at. With this knowledge you can firstly avoid the former scenarios and secondly better understand what AI can do and do well.
AI vs Gen AI
Firstly let me just point out that by “AI” I’m referring to Generative AI or “Gen AI”. Gen AI is a subset of AI, ChatGPT and the likes are basically Gen AI and the technology they use is called a Large Language Model aka LLM. The G from GPT is the “Generative” and the PT is Pre-trained Transformers. The Transformers are the clever bit and these are implemented in LLMs using attention heads.
I could go on, and on, but I wanted to be clear on AI when spoken about in terms of ChatGPT is actually Gen AI and the LLMs are effectively the same thing.
What is AI is not good at?
I have four examples I’d like to give you that are often used to demonstrate how “stupid” AI is and I’d like to explain those and why.
It’s all about tokens
The first part of my explanation should help with all of the examples; The way and LLM works is to first turn every word into a token, a token in this sense is a number. The LLM has no concept of words, it uses the attention heads figure out the meaning of the tokens but the token for “beer” would have a similar meaning to lager, pint, ale and many other words, including the foreign equivalents. The LLM will just see a token (number) though, it will only ever see a token.
The process of turning words into tokens is, unexpectedly, called tokenisation (naturally there’s a z in there for Americans).
The words “I love NY” becomes 306, 5360, 23526 in the Llama LLM.
LLMs can’t spell
ChatGPT 4 sees the word “ Strawberry” as the number 89077. The only way it would “know” there are any Rs at all in the word is if it was trained to count letters or spell. Just as it might “know” that Paris is the capital of France, it gets that fact from training and unless each and every one of the token spellings is trained then it’s going to have no idea if there is a Z in the word Strawberry or 3 Rs.
Getting an LLM to spell is possible but not a great use of training.
LLMs can’t reverse sentences
Not the most useful skill to have but relatively easy even for a 5 or 6 year-old. If we take the example “The dog ate my homework.”, remembering from the previous explanation that LLMs only see tokens an not words the LLMs sees 450, 11203, 263, 371, 590, 3271, 29889 from “The”, “ dog”, “ a”, “te”, “ my”, “ home” and “work.”. It’s worth noting that the words “ate” and “homework.” are not in the dictionary of the LLM (Llama), so the tokeniser breaks the words into parts which further complicates the task. The way the transformers work is to select the most likely word, word by word. It doesn’t see or understand the instruction to reverse the sentence or even what the sentence is. The sees the last token 1287 and uses the others in the input to select the next word. “homework.” would be the correct response but it doesn’t make much sense on it’s own so unlikely to be selected.
Again an LLM could be trained to reverse sentences but again it’s not a great use of a lot of expensive hardware to train it.
LLMs can’t write text with a specific number of words
Hopefully, by now, you’re getting a feel for the way an LLM works. Once it’s outputted a token that’s it, it moves on to the next but doesn’t know what’s coming until it’s there. for me to write the sentence “This sentence has 5 words.”, I need to first think it or write it and then go back and change or fill in the number. An LLM simply can’t do this for two reasons, firstly it writes sequentially, secondly is has no state so can’t maintain a count. To make matters more complex some of the words are split into multiple tokens and even the full stop is a token.
This is something at an LLM is unlikely to be able to do other than for specific use-cases, captured, pre-trained examples.
LLMs can’t do maths
(To my Americans friends, there should definitely be an ’s’ in maths!)
领英推荐
This is probably the most surprising and critical misunderstanding of LLMs. As we’ve seen above, LLMs see tokens and try to predict the reply, token by token. If I ask an LLM a relatively simple maths question, it’s seeing tokens for the numbers and just seeing tokens. For example 27*37 is 29871, 29906, 29955, 29930, 29941, 29955, 29922 in tokens, you can see the two 7’s as “29955”.
Unless the specific question is in the training set the LLM is only going to be able to guess the answer, and that’s usually wrong. If I asked you 27*37 and gave you a second to answer with a price for any number then you’d be doing the same, just guessing. For those with a memory for useless facts like myself you might just know the answer is 999.
The way of dealing with maths in LLMs is changing, they are now being trained to “recognise” simple mathematical instructions and deal with them more systematically. The results in some of the maths-specific LLMs is quite astounding but you should always be careful when using numbers with LLMs, it’s not what they’re designed for but improving significantly.
What are LLMs good at?
We’re still learning how to use LLMs (ChatGPT, Claude, Gemini etc.). The are extremely good at parsing and understanding language, not just English but almost any language, written, spoken or even visually. Most of the work in the internals of the LLM, the transformers is “understanding” or applying meaning to tokens (usually words). A small proportion of what they do but the critical part is generating the next token. An LLM may parse 20 pages of text just to reply with a single sentence.
Knowing this is important, use LLMs for parsing, understanding and interpreting. It’s like the human / animal learning process. A dog understands the word “walk” or “food” but can’t say it, a child understands the word “drink” or “biscuit” before it can say it, an adult learning a foreign language can watch a film in the foreign language long before they can speak it.
LLMs are just tools, like new employees in a company, they perform tasks but if you put them in a group or get them to work with others they can be far more effective. The latest trends with LLMs is what’s called “agentic” where LLMs make “decisions”, write code and function at a higher level than their basic selves.
Using the ability to write code, even the most basic LLM running on a mobile phone can solve all of the problems above by first writing the code.
LLMs can provide a different approach to solving problems
There is one simple way to solve all of the above and we’re seeing this increasingly with many of the proprietary LLMs like ChatGPT, Claude and others. In every case, we first get the LLM to write code or two break sown the question and then either run the code or re-apply the breakdown back into the LLM (or another instance).
All the following code snippets were written by a very basic LLM (Qwen2.5 3B) running on my phone ...
LLMs can’t spell, but…
word = "strawberry"
count_r = word.count('r')
print(f"The number of 'r's in 'strawberry' is: {count_r}”)
> The number of 'r's in 'strawberry' is: 3
LLMs can’t reverse sentences, but…
sentence = "There are n words in this sentence"
word_count = len(sentence.split())? # Count the words in the sentence
sentence_with_n = sentence.replace('n', str(word_count))?
print(sentence_with_n)
> There are 7 words in this sentence
LLMs can’t write text with a specific number of words, but…
sentence = "The dog ate my homework."
reversed_sentence = ' '.join(sentence.split()[::-1])
print(reversed_sentence)
> homework. my ate dog The
LLMs can’t do maths, but…
input_string = "27*37"
numbers = input_string.split('*')
result = int(numbers[0]) * int(numbers[1])
print(f"{input_string} = {result}")
> 27*37 = 999
Conclusion
Like a calculator, LLMs are incredibly good at very specific things. Sticking with the calculator analogy, they can do what they’re designed to do incredibly well but unless we know what they can and can’t do we’re liable to use them in the wrong way.
I hope this article has given you some understanding of why LLMs are not good at some things and perfect for others.
I think there's an inherent limitation because LLMs are processing human language, which by definition is an expression of something abstract in our minds. Like figuring out how to put an emotion into words. You're never going to get an exact expression. Similarly LLMs can't return an exact response.
CSM @ IBM || Experienced CTO || Keynote Speaker || Advisor || Innovator || Senior Leader
1 个月Its even more challenging with image generation. I had this inspiration to use GenAI to create Bikablo diagrams that I could then decompose and use to facilitate the explanation of complex topics. Not sure what language some of the labels were in but it did not look human and why there was a picture of an orange lion in the middle of the diagram was beyond me. ?? I guess on the more general point of "what LLMs are not so good at" the sensitivity to initial prompting shows the root of many of the issues, as you comment below. In this area I have recently also looked at some research by Fran?ois Chollet, who recently stated at AGI-24: "Skill is not intelligence, and displaying skill across any number of tasks does not show intelligence. It’s always possible to be skilled at any given task without requiring any intelligence. And this is like the difference between having a road network versus having a road?building company. If you have a road network, then you can go from A to B, for a very specific set of As and Bs that are defined in advance. But if you have a road?building company, then you can start connecting arbitrary As and Bs on the fly, as your needs evolve." https://arxiv.org/abs/1911.01547
Board Member | Advisor | Innovator | Co-founder @Taskize | Inventor of AMQP | Exec @JPMorgan @BofA @Natwest
1 个月John Davies beautifully succinct
Transforming healthcare to Google-like search with Bystro, the first natural language analysis platform for genetics.
1 个月Great writeup, thanks John!
Architect & Developer at Gresham Technologies
1 个月I think that point was useful to make - LLMs in themselves are powerful at one core task, and while that can be adapted by prompt engineering there are extremely simple tasks which cannot be covered that way. That can cause frustration as there is a tendency to anthropomorphise and thus misunderstand these restrictions. On the other hand available AI tools are quite rapidly expanding out of the ‘pure LLM’ box and using multiple coupled ‘experts’ to deliver wider capabilities that can span these tasks and more, so it is indeed hard to pin down generally.? I guess currently these AI tools have no significant world model beyond the extremely weak model extracted from token correlations, and until they do a wider class of challenges will remain out of reach.