Are bigger AI models better?
In AI “the 2010s were the age of scaling, now we're back in the age of wonder and discovery once again.” This important insight comes from[1] the pioneer of deep-learning AI Ilya Sutskever . So, when the co-developer of AlexNet (the first, successful deep-learning AI model), the co-founder of OpenAI, and most recently co-founder of the AI lab Safe Superintelligence (SSI), calls time on the recent trend of just scaling LLM models to make AI better and better, we all need to pay attention.
It was in the so called Chinchilla[2] paper from 2022 where researchers showed that for optimal training of large language models, the model size and the number of training tokens need to be scaled equally. For every doubling in the size of the model the number of training tokens also needs to double, and when combined with an exponential increase in compute, this scaling approach has produced stunning results. But recently AI developers have started to see a slowdown from just scaling current model methods. It has been reported that the improvements in the next version of Google's Gemini are falling short, Anthropic has delayed its next-generation Claude model, and over at OpenAI we hear[3] that their next-generation Orion model is seeing far smaller gains from model scaling then were previously seen between GPT-3 and GPT-4.
What we need to remember is that language is just an encoding scheme that humans use to share information and knowledge. As an example, engineers and lawyers use special encodings (technical language) to describe complex terms which improve the efficiency and accuracy of their information exchange. But now we have Large Language Models which allow computers to ‘decode’ this human encoding scheme and to learn from our human information. This includes the roughly 500 trillion tokens (or word sequences) that are available on the indexed worldwide-web. However, trying to embedded all of this information into a single model is too expensive and appears to already be reaching its limits.
But humans use other techniques too. We follow a ‘train of thought’ as we ‘reason’ over the information that we have available, and we are starting to see this type of approach being used in AI too. Noam Brown, a researcher at OpenAI, said[4] about their recent ‘o1’ release that, "20 seconds of thinking time" in inference achieves an improvement that would have needed a "100,000x increase in model scale." Another stunning example is the breakthroughs that have been achieved by the Chinese research company DeepSeek-AI who have also used a reasoning approach that leverages Reinforcement Learning , combined together with a highly diverse mixture-of-experts model, to achieve results and which goes beyond what has previously been possible with a single large model, but where this performance is achieved in a far more efficient way. Whereas most AI researchers have just been pushing hard on the scaling lever, perhaps held back by recent export restrictions that limit access to GPU’s, this Chinese team are showing that necessity can become the mother of invention.
I find it very exciting (but also not a surprise) that we are about to see a new wave of innovation in artificial intelligence. When asked about AI development, I always point out that the AI you are using today is the worst AI you will ever use. We are just at the start of an incredible journey. And here are just a few simple ideas to consider:
o?? Today all AI models are single task, whereas in all other software we take advantage of multitasking that utilizes memory management units to protect data and manage data locality. Rather than following a single train of thought most of us would consider multiple ideas and then back track to pick up on different threads, apply a different part of our knowledge expertise, then combine these threads together to find a solution. We will see these concepts emerge in next generation AI models.
o?? Uncertainty Quantification (UQ) is another little used concept that has been very successful in applying AI to improve complex real-world simulation problems. UQ helps to direct the simulation to look at the areas where the uncertainty is highest and to understand how this uncertainty will change based on certain events. This approach has been especially successful in the most complex time-based simulations, such as trying to model how the plasma in a nuclear fusion tokamak reactor will behave over time and how you can control this complex reaction. Innovative young AI companies like digiLab , who is a world leader in UQ, are using this same approach to improve the efficiency, accuracy, and trustworthiness of new agentic AI systems.
Yann LeCun believes[5] (as I do) that AI development should be open-sourced so that we can drive progress forward more rapidly. Open-sourcing and sharing ideas allows researchers to quickly build on the ideas of others, and also provides a much higher level of transparency that will allow us to make AI safer. As LeCun points out, humans: “have the capacity to understand the world, understand the physical world, the ability to remember and retrieve things, persistent memory, the ability to reason, and the ability to plan.” As an example, when I say the word ‘apple’ you instantly know it’s rough size, its shape, its weight, that it grows on trees, the sound it makes when you bite in to it, and the taste that will hit your tongue – oh, and it is also linked to Isaac Newton, the concept of gravity, plus it might also be referring to a computer or smartphone! You have so much real-world context about this simple word that adds much more information and fire’s off different neurons and synapses in your brain when confronted by this simplest of human encodings. By contrast an LLM just knows it has four letters, two of which are repeated, and has some understanding of when it might be useful to add this word into a sentence as part of its very advanced text prediction.
领英推荐
I do not believe that LLMs on their own will ever deliver Artificial General Intelligence (AGI). AGI is a false path - instead we need to focus on building Artificial Expert Intelligence[6]. The good news is “we're back in the age of wonder and discovery.” Welcome to the start of AI deep-learning 2.0, a world of new discoveries and the next generation of breakthrough AI companies.
Head of Research
2 个月Nice article! The steps forwards on the algorithmic side run in parallel, but asynchronously, with advances on the hardware side. The software binds them. Together they are all making solid and steady progress. Kurzweil was right.
Managing Director at Flagchess LTD
3 个月hi nigel i have followed graphcore for years and would be delights to have an interest/shares pre-ipo and when will softbank go to the market??? $100,000 worth minium
Technology Leader, Inventor and Architect : Semiconductors and AI
3 个月Varying the computation per output token is an obvious next step in improving LLM capabilities but it’ll make batched inference next to impossible in the form folks have been assuming. Should have interesting implications for memory hierarchy, TCO/cost per token and hopefully drive interest in efficiency versus scaling. Finally - we can always hope!
Multi-physics Virtual Twins | Product Development | AI
3 个月Insightful
Semiconductor GTM | Innovation | Strategy
3 个月Intelligence is not connecting just words by probability gradient... Rather is connecting dots across different train of thoughts that arises from learning &/or reasoning. Hence it becomes obvious that AGI cannot be attained with current LLMs, & as rightly said by Nigel Toon , lot more needs to done than just scaling.