Road to AGI

Road to AGI

Just watched the brilliant talk “How We Get To AGI” by François Chollet. Highly recommend it if you’re curious about where AI is actually going.

Back in 2019, Chollet introduced the ARC benchmark to test fluid intelligence -the ability to solve new problems on the fly, not just repeat what you’ve seen before. I keep watching this space and thinking about what intelligence really is – and even wrote a post about it, comparing intelligence with “pattern recognition“.

Turns out, large language models completely fail at it. Even after a 50,000x scale-up, models like GPT-4.5 scored just ~10% on ARC1. For comparison, almost any random human would score over 95%.

Things started to shift in 2024 when researchers moved toward Test-Time Adaptation (TTA) – models that can actually adapt their behavior at inference time. OpenAI’s o3 was the first to show near-human performance on ARC1 using this approach. It was a huge milestone.

He argues that intelligence isn’t about how many tasks you can memorize – it’s about how efficiently you can use past experience to handle the unknown. It’s a process, not a skill list.

He breaks intelligence into two types of abstraction:

  • Type 1: value-centric, intuitive, pattern recognition (what deep learning is good at).
  • Type 2: program-centric, symbolic reasoning, step-by-step logic (what current models really struggle with).

Humans blend both seamlessly but AI doesn’t … yet.

His vision is a “programmer-like” meta-learner that:

  • Combines both types of abstraction
  • Adapts in real time
  • Builds and reuses an internal library of abstractions (like a cognitive GitHub)
  • And learns how to learn better over time

It’s an interesting idea – definitely a step up from just scaling static models. He is taking a step-by-step approach:

  • Starting with ARC1, which tests minimal fluid intelligence by presenting simple visual reasoning problems that can’t be solved through memorization.
  • Moving toward ARC2 and ARC3.

ARC2 raises the bar with more complex compositional reasoning tasks that test a model’s ability to combine abstract concepts in novel ways. Unlike ARC1, where some tasks could be solved at a glance, ARC2 requires deliberate, step-by-step thinking – yet it’s still solvable by regular humans with no prior training. AI models, however, perform near zero without test-time adaptation, making it a much more sensitive tool to evaluate real progress.

ARC3 goes even further, introducing dynamic, interactive environments. Instead of fixed input-output tasks, models must explore, discover goals, interpret controls, and learn the rules of an unfamiliar world – completely from scratch. Tasks are procedurally generated and built on core knowledge priors (like objectness, geometry, counting), and success is measured not just by solving the task, but by how efficiently it’s done – mirroring how humans navigate novelty with minimal trial and error. Each version builds on the last, aiming to more accurately measure and push progress toward true general intelligence.

Not so long ago, I also read about how large language models behave in the AI version of the classic strategy game Diplomacy. It’s fascinating – I think we’ll see many more tests like this soon. In that setup, models like O3, Gemini 2.5 Pro, Claude, DeepSeek R1, and LLaMA 4 Maverick compete using negotiation, alliance-building, and betrayal. Each shows distinct strategies.

It offers a new kind of benchmark – multi-agent, dynamic, open-ended. It probes how models handle real human-like behaviors: trust, negotiation, manipulation, adaptation. And it might push us even further toward the kind of general intelligence we actually care about.

I’m definitely keeping an eye on how this unfolds – it’s one of the most exciting developments in AI right now.


Posted

in

by

Tags: