2024’s AI news was filled with “breakthroughs.” Anthropic launched Claude 3.5, OpenAI released GPT-4o, Google unveiled Gemini 2.0. With each release, commentators would say: “AI is one step closer to AGI.”

Then Elon Musk launched Grok-2, claiming it surpassed GPT-4 on certain tests. Public opinion boiled over again: Has AGI arrived? Do we have less than six years left?

I want to pause at this moment and ask a simple question: what are we testing?

The Illusion of “Surpassing”

Grok-2 indeed performs better than GPT-4 on certain standardized tests. But what are these tests?

They are primarily “benchmarks”: MMLU (testing knowledge across mathematics, science, history, law, etc.), HumanEval (code generation), GSM8K (mathematical reasoning), and others. These tests are well-designed, but they only test one specific, quantifiable type of ability.

Imagine if we used “ability to win chess games” to measure human intelligence—then Deep Blue already “surpassed human intelligence” in 1997. But no one would say that.

The reason is: chess is complex, but it’s a closed system. Rules are fixed, objectives are clear, feedback is immediate. The real world isn’t like this.

AI’s progress on benchmarks is similar to Deep Blue’s progress in chess. They’re both optimization within highly structured, clearly defined problems.

Three Dimensions of Human Intelligence

If we’re going to talk about “surpassing human intelligence,” we must first define human intelligence.

Psychologists generally recognize that human intelligence has multiple dimensions:

  1. Cognitive abilities: Problem-solving, pattern recognition, logical reasoning
  2. Adaptability: Learning and adjusting strategies in new environments
  3. Value judgment: Understanding what’s important, what’s not, making trade-off decisions

Current AI is progressing fastest in the first dimension. Grok-2, GPT-4, and Claude all perform excellently on cognitive tests.

But in the second and third dimensions, we’re still far from adequate.

The Problem of Adaptability

Can Grok-2 answer questions about the 2024 World Cup? Perhaps, if its training data includes that information.

But if tomorrow the World Cup rules changed, the event moved from summer to winter, the venue shifted from Earth to the Moon, what would Grok-2 do?

Could it discover this change on its own and quickly adjust its understanding? No.

It could only wait for new training data. But humans, watching one soccer game on the Moon, could immediately understand the new rules and begin thinking about new strategies.

The Problem of Value Judgment

The deeper issue is values.

Grok-2 might be able to write a complete paper on climate change. But it doesn’t know which choice, among all options, is most important for humanity. It doesn’t know whether to prioritize economic growth or environmental protection. It doesn’t know whose voice to listen to.

The core of human intelligence is the ability to make trade-offs between different values. And this trade-off capability is completely absent in current AI.

The Mathematics of the Six-Year Timeline

OK, but if AI capabilities improve at some rate each year, couldn’t they reach AGI in six years?

This involves a classic misconception: exponential growth.

Many people believe AI development is exponential. Moore’s Law, computational power growth, dataset expansion—all are driving AI acceleration.

But this has a major problem: we don’t have infinite computational power, nor infinite data.

Currently, the largest AI models have already reached computational resource bottlenecks. Training GPT-4 required hundreds of billions of dollars in investment. The next generation of models may require even more.

Meanwhile, text data that can be mined from the internet is finite. We’re already approaching “data famine.”

A recent Google study shows that large language model performance improvement rates are already slowing. From 2020 to 2024, improvement rates slowed from about 40% per year to around 10% per year.

If this trend continues, and without new technological breakthroughs, AI won’t reach AGI in six years. It might take longer—possibly a decade, possibly twenty years, or possibly never.

The Real Significance of Grok-2

Grok-2’s emergence means AI competition has intensified. X (formerly Twitter) has the resources and motivation to compete with OpenAI. This might accelerate certain aspects of AI progress.

But Grok-2 surpassing GPT-4 on some benchmarks doesn’t mean it’s closer to AGI.

It only means that in some specific test, Grok-2’s optimization was done better.

And AGI, if it truly exists, would be a qualitative transformation—capable of self-improvement, self-goal-setting, truly understanding human values, which is a different level from scoring 2% higher on MMLU.

Conclusion: Waiting for the Next Breakthrough

What I’m saying is: based on current trends, a six-year timeline is overly optimistic. AGI might come, but not through calculations like this.

AI will continue to progress. But progress might be an S-curve, not an exponential curve. We may have already passed through the rapid growth phase and are now entering the plateau stage.

And true AGI might require a new technological breakthrough—perhaps new algorithms, new hardware, or a new understanding of intelligence itself.

Until then, just enjoy AI’s progress on specific tasks. Don’t let a test score substitute for your judgment about AGI.