Software 2.5, Software 3.0 and Software 3.5

A couple of weeks ago Andrej Karpathy generated a ton of buzz with a talk where he presented the idea of Software 3.0 - describing LLMs as the next revolution of computers and programming itself - the transition from programming language-driven machines to natural language-driven machines. The software 1.0 inbetween is the age of AIs that are not instructed but trained.

Karpathy does a great job setting up these big eras - but reflecting on the current state of AI and AI history I think we can suggest a litte more detail and further developments.

There was a time when Everyone Knew - that to do true AI we needed to figure out how to make computers reason (whatever that meant then and means now) over symbols - inside a thinking machine there would be some kind of world model with syllogisms and laws that the robot knows we should observe - in other words - intelligence, in this thinking is certainly not just autoregressive word salad however tasty the word salad is.

This was the era Prolog, Cyc and similar ideas. You still find this core thinking in AI critiques like that of Gary Marcus. It's an age-old conflict. It has played our for decades on a small scale in the field of speech recognition - as the battle between linguists with syntax models and engineers using a purely statistical/information theoretical approach. A famous quip from Fred Jelinek shows the fierceness of the fight: "Every time I fire a linguist, the performance of the speech recognizer goes up"

The problem of certainty remains, though. The models are not self aware and consequently also not aware of the mistakes they make and this basic insuffiency shows up all over the place.

Software 2.5

There's been a couple of approaches to dealing with it. A first approach - we can call this software 2.5, software that's not quite 3.0 - is to instruct the LLMs to build a programmed solution for the problem. LLMs are great at building code - and programming languages offer us an ability to add a little rigor and structure to solutions. We can require the production to be testable as an example - demand LLM-derived programs that have a certain interface and so on.

Software 3.0

The current approach in software 3.0 is reasoning models - instead of adding rigor and structure with code - add rigor by playing out the initial answer as language and basically feeding this approach into the model itself for further production. This also adds depth since some of the structure can be planning and so on. This works remarkable well and has led to the most recent revolutions in performance just last week - with both OpenAI and Google reporting progress in math reasoning - by achieving the crazy feat of gold in the math olympics. Of special relevance here is a technical improvement over Google's previous math olympics results: Previously the team had translated the problems from natural language to a formal programming language for theorem proving and worked from that. This time around they did not. To paraphrase, they fired a linguist and the results improved.

Software 3.5

But this is of course math... a highly controlled and closed domain.. very, very different from the real world. For one thing - there's a grading at the end and that's the only way we know this even worked. There's no such thing in the real world.

My gut feeling is that we'll see a reemergence of the old symbolic AI before we're done with this revolution... but in the form of an upgrade of the neural reasoning models from running on text tokens to running on bytecodes - the symbols will come back - as improved software on the 3.0 stack - software 3.5. The reason I think this will happen is that the future AIs need to be able to replay and verify the answers - and that this needs to be a lot more fluid than running tools to evaluate programs.

There are engineering reasons we need an upgrade also - the pure text models are inherently unsafe - offering close to no boundaries between data and input - making them extremely vulnerable to injection and other attacks. This vulnerability is a big technical problem in deploying AI in any interactive but important domain - but it is also a huge weakness in terms of intelligence. A truly intelligent bot must understand the difference between 'self' - the bytecode or core generation - and 'other' - the inputs.

I don't see this as possible at all in a 'pure text' universe - 3.0 will need an upgrade to 3.5.