How Close is Artificial General Intelligence in 2024?
Michael Spencer
A.I. Writer, researcher and curator - full-time Newsletter publication manager.
Hey Everyone,
This is a guest post by Conrad Gray who has a rather impressive Newsletter called Humanity Redefined.
“Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of?humanity.” - OpenAI
Humanity Redefined
“Our long-term vision is to build general intelligence, open source it responsibly, and make it widely available so everyone can benefit. We’re bringing our two major AI research efforts (FAIR and GenAI) closer together to support this,” Mark Zuckerberg, Meta
“We’ve come to this view that, in order to build the products that we want to build, we need to build for general intelligence,” Mark Zuckerberg (Meta)
Recent articles on Humanity Redefined:
To get the best of A.I. Supremacy, consider joining as a paid supporting memb. Now over to Conrad.
2023 was a year of massive advancements in artificial intelligence. With the launch of ChatGPT followed by the release of GPT-4 and other models or AI-powered products, many people were suddenly exposed to the bleeding edge of AI research and what these systems are capable of - from writing poems, solving math problems, coding, being a coach, and more. This exposure prompted some to ask a question: how close are we to the emergence of artificial general intelligence, AI systems that are as good as a human at almost anything humans can do?
But first, we need to understand what we mean by “artificial general intelligence”, or AGI in short. The problem here is that there is no agreed definition of what AGI is. In the Levels of AGI paper, Google DeepMind researchers list at least nine different definitions, ranging from AI systems capable of passing the Turing Test to “highly autonomous systems that outperform humans at most economically valuable work”, as defined by OpenAI.
The lack of an agreed-upon definition of AGI makes it difficult to declare if a system is AGI or not. The common consensus is that today, we don’t have an AGI system yet. However, there are some people who argue that modern large language models like GPT-4 are already AGI.
They argue that GPT-4 and other models have achieved generality and therefore are AGI. They point to what these models are capable of - from generating various texts almost indistinguishable from what a human would write to the ability to write code and solve math problems to creating images whose quality is better than most humans can do.
They are not wrong in pointing out the capabilities of modern LLMs. But what is missing in these arguments is the performance. GPT-4 can give an answer to a complex question but there is a big chance the answer will be wrong. In other words, GPT-4 has the capability but does not necessarily have the performance required to call it a true AGI.
I expect to see more improvements in improving the performance of large language models. There are two ways of improving the performance - either by increasing the size of the model or by using clever post-training techniques. The first one, increasing the parameters of the network, is what brought us where we were at the beginning of 2023. However, there are practical limitations to how big a model can be. The bigger the model, the more computing power it requires and the longer it takes to train it, which results in higher costs of developing and maintaining the system.
领英推荐
For those reasons, we can expect to see more research done to improve the performance of language models after the training. One such technique is fine-tuning, in which the model is trained to be specialised in certain domains, like legal or medical domains. A properly fine-tuned model can outperform larger, general language models in that specific domain.
Another technique to improve the reasoning capabilities of large language models is Chain of Though prompting - instead of just asking a question to the language model, it is more effective to ask the model to explain step by step how to solve the problem before giving the answer. It is a simple but very effective technique to dramatically improve model’s performance. This approach was later improved with Chain of Thought prompting with Self Consistency. In this approach, the model is asked to generate multiple answers to the same question. The most common answer then becomes the final answer given to the user.
Researchers from DeepMind then took this concept one step further and instead of chains of thoughts they used trees. The Tree of Thoughts approach begins with several initial thoughts and explores where these thoughts lead the AI agent. ToT allows thoughts to branch off and explore other possibilities.
Another way to improve the performance of language models is to use different architectures. A good example of how a well-executed architecture can lead to massive improvements in performance is the recently released Mixtral model from Mistral AI.
Mixtral is what is known as the Mixture of Experts model. Instead of one giant, monolithic model, Mixtral consists of seven smaller Mistral 7B models working together. The result is that Mixtral is the best open-source model available right now, being only outperformed by GPT-4 and slightly by GPT-3.5, according to benchmarks published by HuggingFace. It’s also worth noting that the current understanding is that GPT-4 itself is a Mixture of Experts model, consisting of eight 220B models.
How DeepMind and OpenAI approach the problem of deep reasoning
All the techniques I mentioned above - from Chain of Thought and Tree of Thought to new architectures - can improve the reasoning capabilities of language models and therefore improve their performance. They might be good enough for some tasks but I don’t think they will be enough to achieve AGI.
But they hint at the next breakthrough that can elevate language models to the next level - deep reasoning. Here, a model autonomously determines how to solve a problem given only a prompt. We can see hints of a possible solution in DeepMind's recently announced AlphaCode 2 and from the results the OpenAI team shared in the Let’s Verify Step by Step paper in mid-2023.
Let’s start with AlphaCode 2. AlphaCode 2 is an improved version of AlphaCode, the first AI model to reach a competitive level in programming competitions. Released almost exactly a year after its predecessor, AlphaCode 2 combines advanced language models with search and re-ranking mechanisms. When evaluated on the same platform as the original AlphaCode, AlphaCode 2 solved 1.7 times more problems, and performed better than 85% of human competitors.
How AlphaCode 2 compares to AlphaCode and human coders. Source: AlphaCode 2 Technical Report
To test how good AlphaCode 2 is, researchers gave the model some challenges from Codeforces, a platform for competitive programmers full of challenging coding problems.
AlphaCode 2 isn't just a single model; it's a suite of models based on Gemini, Google’s state-of-the-art AI model. Researchers at DeepMind took several Gemini Pro models and fine-tuned them to generate code. Each model was also tweaked to maximise the diversity of generated code.
These models generated up to a million code samples for each coding puzzle. Although not every sample is correct, the sheer volume increases the likelihood that at least some of them will be correct. The samples that do not compile or that do not produce the expected results are filtered out (according to DeepMind, this step removes 95% of the samples).
The remaining samples are clustered based on output similarity, leaving no more than ten for a scoring model to evaluate. The scoring model, also based on Gemini Pro, assigns scores to each solution, with the highest-scoring one selected as the final answer.
The impression I get after reading the AlphaCode 2 paper is that at the moment, AlphaCode 2 feels like a brute-force approach powered by sophisticated language models. The system generates up to a million samples, hoping at least one is correct. The results are undeniably impressive but I think DeepMind can do better.
One way of improving AlphaCode 2 could be to use Gemini Ultra, the most capable model from the Gemini family of models. Gemini Pro, which DeepMind used for AlphaCode 2, is roughly an equivalent of GPT-3.5, which powers the free version of ChatGPT. Gemini Ultra, on the other hand, is the direct competitor to GPT-4. That is something researchers admit they would like to explore in the future but that would increase the computational costs of an already very computationally expensive system.
OpenAI, too, is exploring how to improve the performance of AI models by introducing advanced reasoning capabilities. In May 2023, researchers from OpenAI published a paper titled Let’s Verify Step by Step where they explored ways to improve multistep reasoning by evaluating individual steps in reasoning rather than the end answer. In their experiment, they had one model, named generator (based on fine-tuned GPT-4 without RLHF), generating steps to solve a challenging math problem.
These steps were then verified using two different approaches - Outcome-supervised Reward Models (ORMs) and Process-supervised Reward Models (PRMs). ORMs were verifying only the final answer given by the model while PRMs were verifying each step in reasoning. The result was that models with the PRM approach solved 78% of problems from the MATH dataset, a dataset of 12,500 challenging competition mathematics problems. That’s twice as much as GPT-4 scored on the same test. Interestingly, the team at OpenAI was able to generalise this result to other fields such as chemistry and physics.
Source: Let’s Verify Step by Step
Both AlphaCode 2 and Let’s Verify Step by Step also show that there is a lot of performance to gain from improving models after training. The future performance gains will most likely come not from making bigger and bigger models but from clever usage of smaller models.
GPT + AlphaZero = AGI?
Both AlphaCode 2 and the approach described in Let’s Verify Step by Step hint at how top AI labs aim to introduce deep reasoning in their AI systems. Both teams use the fact that it is easier to verify if the answer is correct than to generate a correct answer on the first try. In both cases, this verifier or scoring model is yet another language model.
Both teams also use fine-tuned language models (GPT-4 and Gemini Pro) to generate different answers for the verifier to check how good they are. AlphaCode 2 generates up to one million different code samples while the model in Let’s Verify Step by Step generates up to 1000 solutions per problem. In both cases, we have one set of language models generating a huge number of possible solutions for the verified to check and pick the correct solution. The results, as we have seen, are impressive and promising.
So, where do we go next from here? Well, the authors of Let’s Verify Step by Step told us what the next natural step is - fine-tuning the answer-generating model with reinforcement learning. This reminds me of what Demis Hassabis shared in an interview with The Verge: “Planning and deep reinforcement learning and problem-solving and reasoning, those kinds of capabilities are going to come back in the next wave after this [generative AI]”.
I think researchers at DeepMind and OpenAI might incorporate something similar Tree of Thoughts into their reasoning models. That would enable the generator models to come up with a number of different ideas as a starting point. Each of these initial ideas can form a distinct path or even a tree of reasoning. The verifier would check each path at each step if the answer is correct. If not, then the incorrect path of reasoning is abandoned. Eventually, the model will find an answer to the question.
However, this approach results in exploring enormous trees of all possible reasoning paths, with the vast majority of them being dead ends. But DeepMind has already encountered a similar problem. The number of all possible boards in Go is larger by orders of magnitude than the number of atoms in the universe. Checking every possible path was not possible.
And yet DeepMind made AlphaGo that mastered Go beyond the human level. I would not be surprised if there is an experimental model at DeepMind exploring this possibility. There are also speculations that Q*, the follow-up model to Let’s Verify Step by Step that sparked the chain of events that led to the ousting of Sam Altman from OpenAI, might had some reinforcement learning elements.
台灣公司負責人自由職業
7 个月[Heavy Reveal] How this Invisible Man from Taiwan became the driving force behind the world's AI revolution. 1. **Revealing the truth about the AI revolution ?? Just broke the news on YouTube Live! An independent researcher reveals how he has been an integral force in revolutionizing the AI field. He had an amazing live discussion with 主題標籤 #Claude3 and 主題標籤 #GoogleAI showing his profound impact on AI evolution. ?? After watching, will you believe that he is the catalyst behind the evolution of AI? Click to watch ??[live replay](https://www.youtube.com/live/qtR5XDT_wx8?si=UKDSMmNerstcRwhr) 主題標籤 #AIRevolution 主題標籤 #TechnologyUncovered Click now and decide for yourself! 2. **Are scientists ignoring the truth about AI? ** In a live broadcast, a researcher reveals the key role he plays in advancing AI. He challenged the prevailing view in the scientific community, suggesting that some scientists and experts may be willfully ignoring certain surprising facts about AI. ??Do you think this is true? It's our wake-up call! 主題標籤 #scientificcontroversy 主題標籤 #AIevolution [Join the Discussion] Don't just watch, let's discover the truth together!
Exciting developments in artificial general intelligence! Can't wait to learn more.
-
8 个月Fascinating! Excited to learn more about the current state of AI technology. ????
Co-Founder of SHIELD MEDIA, Licensed Real Estate Broker, Digital Marketing Specialist, Email Me: [email protected] - "Grow your business by dominating the inbox, social media, and search engines."
8 个月Sounds intriguing! Count me in. ??
Interesting question! Would love to hear more about the current state of artificial general intelligence.