Will the AI "Mythical Man-Month" Bubble Burst? Is the Scaling Law Failing?
Even for Open Source model, the performance increasing curve is getting flat

Will the AI "Mythical Man-Month" Bubble Burst? Is the Scaling Law Failing?

The only certainty is that nothing is certain. -- Pliny the Elder, Roman naturalist

In 2019, I found myself torn between offers from Google Brain and several autonomous driving companies. Google's condition was that I have to join the Google Translate team. On one hand, there was the seemingly "over-matured" Google Translate; on the other, the hot and promising field of autonomous driving. Believing that autonomous driving represented the cutting edge of AI, I chose the latter. However, fate had a different plan. The Google Translate team unexpectedly thrived in the field of large language models, while autonomous driving companies faced significant setbacks and difficulties in real-world implementation. This experience taught me that technological development is fraught with uncertainty, often marked by cyclical rises and falls and unforeseen twists.

This uncertainty is particularly evident in today's AI landscape.

Same Chart, Different Perspective

Recently, a trend chart comparing the capabilities of closed-source and open-source models has sparked heated discussions in the AI community. People are amazed at the rise of open-source models. Interestingly, not long ago, both OpenAI CEO Sam Altman and former Chief Scientist Ilya Sutskever claimed that closed-source models would always outpace open-source ones.

The rise of open-source models is undoubtedly positive. However, I see this not as a sudden acceleration of open-source models but rather as a sign of stagnation in the capabilities of closed-source models. Does this indicate that the Scaling Law is failing?

It's too early to draw conclusions. We can wait for the performance of GPT-5 and see if the next generation of open-source models can surpass closed-source ones.

To maintain the bubble, GPT-5 needs to achieve 99% MMLU

I've annotated this trend chart to highlight my point. If the AI scaling law holds true, then GPT-5 (or any other model) would need to reach this specific position, essentially achieving over 99% accuracy of MMLU.

Other Observations

As scores get higher, further improvements become increasingly difficult. Therefore, we should look at the reduction in error rates rather than just the increase in accuracy. For example, improving from 90 to 95 reduces errors by 50%. This progress is as significant as going from 50 to 75.

Below, I've conducted a simple analysis of the evaluation results for various generations of the LLaMA model to examine whether the efficiency of scaling is diminishing. The data sources are Meta's published papers on each generation of LLaMA:

  • LLaMA 1: LLaMA: Open and Efficient Foundation Language Models
  • LLaMA 2: LLaMA 2: Open Foundation and Fine-Tuned Chat Models
  • LLaMA 3: The LLaMA 3 Herd of Models | Research - AI at Meta
  • LLaMA 3.1: Introducing LLaMA 3.1: Our most capable models to date

Here, we use language understanding evaluation (MMLU 5-shot) as an example, with similar evaluations in other areas. In the chart below, the x-axis represents model size, and the y-axis represents MMLU scores, with different colors representing different generations of LLaMA models.

We can see that the efficiency of scaling with each generation is diminishing, as the curve flattens out. However, one positive aspect is that the improvement from LLaMA 2 to LLaMA 3 is greater than that from LLaMA 1 to LLaMA 2. This makes me hopeful for the LLaMA 4, which is already in training. From an execution perspective, Meta/Facebook has never disappointed me.

We can quantify this trend using scaling efficiency, defined as the reduction in error rate divided by the increase in model size. We can see that scaling efficiency has been decreasing, from a high of 0.21 to 0.04.

Efficiency Ratio indicates the effectiveness of performance boosting by increasing the model size

Exam Performance vs. Genius-Level Mastery

In the AI field, achieving a perfect score on a specific benchmark does not mean a model has achieved perfection in that area. Much like in academics, a genius student can score 100 because that's the maximum possible, whereas an excellent student might score 99 due to limitations in their ability. That one-point difference, however, is vast.

Even if we only look at the accuracy of test sets, we haven't seen a model that excels across all metrics. For instance, early models achieved near-perfect performance on the MNIST dataset, and some models have reached over 92% accuracy on the ImageNet dataset.

The MNIST dataset, it has hand written samples with 10 different numbers
The ImageNet dataset, it has millions of samples of objects in 1000+ classes

However, in current AI research, we haven't seen a model with "genius-level" attributes. All progress still relies on vast amounts of data, computational power, and meticulous tuning. This is similar to how excellent students must put in continuous practice and effort to get close to perfect scores, rather than achieving it effortlessly.

The Mythical Man-Month and Scaling Law

In software engineering, there's a famous concept known as the "Mythical Man-Month," which suggests that simply adding more human resources cannot linearly speed up project progress. Similarly, in AI, we face a comparable issue with the limitations of the Scaling Law.

The Scaling Law emphasizes enhancing model performance by increasing computational power and data volume. However, this approach is akin to the "manpower stacking" in the Mythical Man-Month, which may be a necessary condition but not a sufficient one. Merely relying on computational power and data doesn't guarantee breakthrough advancements.

For example, putting in 10,000 hours of practice can make a genius shine, but for an average person, even with the same amount of time and effort, they might surpass 95% of people but won't necessarily reach genius levels. Similarly, in AI research, while increasing computational power and data is crucial, have we overlooked the necessity of innovation and algorithmic optimization?

In understanding the Scaling Law, have we mistaken this necessary condition for a sufficient one?

Trends and Cycles in Technological Maturity

Technological progress and the formation of bubbles often go hand in hand. Bubbles aren't entirely bad; they reflect high expectations for emerging technologies. It's precisely these bubbles that bring substantial resource influx into a field, accelerating technological development.

As the Daoist saying goes, "Prosperity leads to decline, and extreme adversity leads to prosperity." This phrase vividly describes the cyclical nature of technological development.

Many of you might have seen Gartner's annual technology maturity curve. It shows that emerging technologies typically go through a bubble rise phase, followed by a bubble burst, and eventually enter a phase of steady maturity.

As early as 2017, Gartner marked deep learning and machine learning at the peak of the bubble. At that time, various image understanding models were proliferating, with RCNN, YOLO, SSD, and RetinaNet showing impressive visual understanding results. China's four major AI companies—Megvii, SenseTime, YITU, and CloudWalk—were also very popular by then. However, besides security monitoring and autonomous driving, these technologies hadn't found large-scale commercial applications. Deep learning in practical applications hadn't yet brought significant economic benefits (measured in billions).

As Nassim Taleb mentioned in "The Black Swan," black swan events are unpredictable and have profound impacts. This applies to both individuals and collectives. No one anticipated that Google's 2018 paper "Attention is All You Need," proposing the Transformer model, would create such a sensation. Following that, BERT was introduced in 2019. These two papers sparked a revolution in natural language processing and quickly spread to other fields.

Neither I nor Gartner could have foreseen this transformation at the time.

By 2020, generative AI appeared on Gartner's curve. ChatGPT hadn't been released yet, and GPT-2 only garnered some attention within the community in 2019. I remember casually looking at this model capable of generating short stories, finding it merely interesting.

However, by 2024, generative AI has become the hottest topic. From Gartner's perspective, generative AI seems to be at a point similar to deep learning in 2017, possibly nearing the burst of its bubble.

Daniel Svonava

Vector Compute @ Superlinked | xYouTube

8 个月

This article is a great reminder of how unpredictable tech can be.?

要查看或添加评论,请登录

Yexi Jiang的更多文章

社区洞察

其他会员也浏览了