2024’s AI news has been filled with “breakthroughs.” Anthropic released Claude 3.5, OpenAI launched GPT-4o, and Google announced Gemini 2.0. With each release, commentators say: “AI is one step closer to AGI.”
Then Elon Musk released Grok-2, claiming it outperforms GPT-4 on certain tests. The internet erupts again: Has AGI already arrived? Do we have less than six years left?
I want to pause in this moment and ask a simple question: What are we actually testing?
The Illusion of “Surpassing”
Grok-2 does indeed perform better than GPT-4 on certain standardized tests. But what are these tests?
They’re primarily “benchmark tests”: MMLU (covering math, science, history, law, and other knowledge), HumanEval (code generation), GSM8K (math reasoning), etc. These are well-designed tests, but they only measure one specific, quantifiable type of ability.
Imagine if we measured human intelligence by “can you win a chess game.” Then Deep Blue would have “surpassed human intelligence” in 1997. But no one says that.
Why? Because chess, while complex, is a closed system. The rules are fixed, the objective is clear, the feedback is immediate. The real world isn’t like that.
AI’s progress on benchmarks is analogous to Deep Blue’s progress in chess. Both are optimizations on highly structured, well-defined problems.
Three Dimensions of Human Intelligence
If we’re going to talk about “surpassing human intelligence,” we first need to define human intelligence.
Psychologists widely recognize that human intelligence has multiple dimensions:
- Cognitive Ability: Problem-solving, pattern recognition, logical reasoning
- Adaptability: Learning and adjusting strategies in new environments
- Value Judgment: Understanding what matters and what doesn’t, making trade-off decisions
Current AI is progressing fastest on the first dimension. Grok-2, GPT-4, and Claude all excel on cognitive tests.
But on the second and third dimensions, we’re still far behind.
The Adaptability Problem
Can Grok-2 answer questions about the 2024 World Cup? Maybe, if its training data includes it.
But what if tomorrow the World Cup rules changed? What if the tournament moved from summer to winter, from Earth to the Moon?
Can Grok-2 discover this change itself and rapidly adjust its understanding? No.
It can only wait for new training data. Humans, seeing a soccer game on the Moon, would instantly understand the new rules and start thinking about new strategies.
The Value Judgment Problem
The deeper issue is values.
Grok-2 might be able to write a complete essay on climate change. But it doesn’t know which choice, among all possibilities, matters most to humanity. It doesn’t know whether to prioritize economic growth or environmental protection. It doesn’t know whose voice to listen to.
The core of human intelligence is the ability to navigate trade-offs between different values. And this capacity for value judgment, current AI completely lacks.
The Mathematics of the Six-Year Timeline
OK, but if AI capability improves at some rate every year, wouldn’t it reach AGI in six years?
This touches on a classic misconception: exponential growth.
Many believe AI development is exponential. Moore’s Law, computing power growth, data expansion—all driving AI acceleration.
But there’s a huge problem: we don’t have unlimited computing power or unlimited data.
Currently, the largest AI models have hit computational resource bottlenecks. Training GPT-4 required hundreds of billions of dollars in investment. The next generation might require even more.
Meanwhile, publicly available text data on the internet is finite. We’re approaching “data starvation.”
Google’s recent research shows that large language model performance improvements are already slowing. From 2020 to 2024, the improvement rate slowed from about 40% per year to about 10% per year.
If this trend continues without new technological breakthroughs, AI won’t reach AGI in six years. It might take decades—or possibly never.
What Grok-2 Really Means
Grok-2’s emergence means AI competition is intensifying. X (former Twitter) has the resources and motivation to compete with OpenAI. This might accelerate progress on some AI fronts.
But Grok-2 surpassing GPT-4 on some benchmarks doesn’t mean it’s closer to AGI.
It just means Grok-2 is better optimized for that particular test.
And AGI, if it truly exists, would be a qualitative transformation. Not getting 2% more points on MMLU, but being able to self-improve, set its own goals, and understand human values.
Conclusion: Waiting for the Next Breakthrough
I’m not saying AGI will never come. I’m just saying that based on current trends, the six-year timeline is too optimistic.
AI will continue to improve. But improvement might follow an S-curve, not an exponential curve. We may have passed the rapid growth phase and are now in the plateau.
True AGI might require a new technological breakthrough—perhaps a new algorithm, new hardware, or a new understanding of intelligence itself.
Until then, we should enjoy AI’s progress on specific tasks while remaining vigilant not to be fooled by benchmark numbers.
Surpassing human intelligence is not a numbers game.
💬 Comments
Loading...