The unreasonable effectiveness of stochastic parrots
The year is 2040 and the ARC Prize make the following exciting announcement: "We are pleased to announce ARC-AGI 16, the next evolution of the ARC Prize. Current frontier models score 1.2% on this benchmark where humans easily score 100%". Granted it's 2026 and they are only at the 3rd benchmark - but no sign of slowing down anytime soon - in fact they expect new benchmarks annually.
Vision Language models routinely perform just as well or better without images as with images: "attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images."
Take standard programming benchmarks - convert them to unseen programming languages and AI coding model performance collapses completely.
It's parrots all the way down
All of these are evidence that yes indeed - AI models are simply not smart. They are extremely interesting pattern databases, that are able to take seen patterns and unseen problems and propose a "pattern reasonable" extrapolation of the new unseen input.
It's impossible to be aware of these findings and still think - sure we're very very close to AGI. This explains the push to redefine AGI from meaning - you know, actually intelligent; able to learn and function in unknown environments - to meaning "something that can do a lot of valuable things".
And that's the thing - this they absolutely can do; a lot of valuable work. The interesting thing about llms is not that they are not intelligent - but that what these pattern engines can do is so widely applicable. The LLMS can do a formidable range of tasks.
Unreasonably effective systems
"Reasonable pattern exploration" makes rewriting of inputs sound like less than it probably is - consistently rewriting or expanding on an input is probably quite the universal algorithm - even if there are as yet no signs that it's how we work.
And in fact running the AIs iteratively with useful world testing capabilities proves to be the massive difference maker. The AI systems are wastly more powerful than the AI models - because of the system design around them. An example of this was discovered during the project on unseen programming languagues I link to at the beginning.
So it's all just software in the end
Where exactly the models and and the systems begin is - I think quite unknown at this point. Which btw should be a point of massive concern in the big model companies. There's no way they no that they are safe - that the AI systems that actually deliver the value - relies on any particular model. It's in no way clear that you could not design way better systems that lean a lot less hard on specific models.
Another important factor here: The systems don't obey any scaling laws. They are 'just' software. Which is another way to say they they are as unlikely as all of 1956->2017 AI research to bring about general intelligence. You can keep re-programming them for specific workflows but that's what you're doing - designing workflows for problems.