Useful information

Prime News delivers timely, accurate news and insights on global events, politics, business, and technology

OpenAI’s O3 shows notable progress on ARC-AGI, sparking debate over AI reasoning


Join our daily and weekly newsletters to get the latest updates and exclusive content on industry-leading AI coverage. More information


OpenAI’s latest o3 model has made a breakthrough that has surprised the AI ​​research community. o3 scored a record 75.7% on the super-hard ARC-AGI benchmark under standard computing conditions, and a high-computing version hit 87.5%.

While the achievement in ARC-AGI is impressive, it still does not prove that the code on artificial general intelligence (AGI) has been cracked.

Abstract Reasoning Corpus

The ARC-AGI benchmark is based on the Abstract Reasoning Corpuswhich tests an AI system’s ability to adapt to novel tasks and demonstrate fluid intelligence. ARC is made up of a set of visual puzzles that require the understanding of basic concepts such as objects, boundaries, and spatial relationships. While humans can easily solve ARC puzzles with very little demonstration, current AI systems struggle with them. ARC has long been considered one of the most challenging AI measures.

ARC puzzle example (source: arcprize.org)

ARC has been designed so that it cannot be fooled by training models with millions of examples in the hope of covering all possible puzzle combinations.

The benchmark is composed of a public training set containing 400 simple examples. The training set is complemented by a public evaluation set containing 400 puzzles that are more challenging as a means of evaluating the generalization of AI systems. The ARC-AGI Challenge contains private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without risking leaking data to the public and contaminating future systems with prior knowledge. Additionally, the competition places limits on the number of calculations participants can use to ensure that puzzles are not solved through brute force methods.

A breakthrough in solving novel tasks

o1-preview and o1 scored a maximum of 32% in ARC-AGI. Another method developed by the researcher. Jeremy Berman used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3.

in a blog postFrançois Chollet, creator of ARC, described o3’s performance as “a striking and significant step-function increase in AI capabilities, showing a novel task adaptation capability never before seen in GPT family models.”

It is important to note that using more computing in previous generations of models could not achieve these results. To put it in context, it took 4 years for the models to progress from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. While we don’t know much about the architecture of o3, we can be sure yes. It is not an order of magnitude larger than its predecessors.

Performance of different models in ARC-AGI (source: arcprize.org)

“This is not simply an incremental improvement, but a genuine advance, marking a qualitative shift in AI capabilities compared to the previous limitations of LLMs,” Chollet wrote. “o3 is a system capable of adapting to tasks it has never encountered before, possibly approaching human-level performance in the ARC-AGI domain.”

It is worth noting that o3 performance in ARC-AGI comes at a high cost. In the low computing setting, it costs the model between $17 and $20 and 33 million tokens to solve each puzzle, while in the high computing setting, the model uses about 172 times more computing and billions of tokens per puzzle. problem. However, as inference costs continue to decline, we can expect these numbers to become more reasonable.

A new paradigm in LLM reasoning?

The key to solving novel problems is what Chollet and other scientists call “program synthesis.” A thinking system should be able to develop small programs to solve very specific problems and then combine these programs to address more complex problems. Classical language models have absorbed a lot of knowledge and contain a rich set of internal programs. But they lack compositionality, which prevents them from solving puzzles that are beyond their training distribution.

Unfortunately, there is very little information about how the o3 works under the hood, and here the opinions of scientists diverge. Chollet speculates that o3 uses a type of program synthesis that uses chain-of-thought (CoT) reasoning and a search mechanism combined with a reward model that evaluates and refines solutions as the model generates tokens. This is similar to what open source reasoning models have been exploring in recent months.

Other scientists like Nathan Lambert of the Allen Institute for AI suggest that “o1 and o3 may actually be just direct steps of a language model.” On the day o3 was announced, OpenAI researcher Nat McAleese published in X that o1 was “just an LLM formed with RL. o3 is driven by further extension of RL beyond o1”.

On the same day, Denny Zhou of Google’s DeepMind reasoning team called the combination of search and current reinforcement learning approaches a “dead end.”

“The most beautiful thing about LLM reasoning is that the thought process is generated autoregressively, rather than relying on searching (e.g. mcts) in generation space, whether by a well-tuned model or a carefully crafted message” , said. published in X.

While the details of o3’s reasons may seem trivial compared to the advancement of ARC-AGI, they may very well define the next paradigm shift in LLM training. There is currently a debate about whether the laws of scaling LLMs through training and computing data have hit a wall. Whether scaling in test time depends on better training data or different inference architectures may determine the next path forward.

Not AGI

The name ARC-AGI is misleading and some have compared it to resolving AGI. However, Chollet emphasizes that “ARC-AGI is not a litmus test for AGI.”

“Passing ARC-AGI is not the same as achieving AGI, and in fact I don’t think o3 is AGI yet,” he writes. “O3 still fails at some very easy tasks, indicating fundamental differences with human intelligence.”

Furthermore, he notes that o3 cannot learn these skills autonomously and relies on external checkers during inference and human-labeled chains of reasoning during training.

Other scientists have pointed out flaws in the results reported by OpenAI. For example, the model was fine-tuned on the ARC training set to achieve state-of-the-art results. “The solver should not need much specific ‘training’, neither in the domain itself nor in each specific task,” writes the scientist Melanie Mitchell.

To verify whether these models possess the type of abstraction and reasoning for which the ARC benchmark was created, Mitchell proposes to “see if these systems can adapt to variants in specific tasks or to reasoning tasks using the same concepts, but in other domains.” in addition to ARC. “

Chollet and his team are currently working on a new benchmark that challenges o3 as it could reduce its score to less than 30% even with a high compute budget. Meanwhile, humans could solve 95% of the puzzles without any training.

“You’ll know AGI is here when the exercise of creating tasks that are easy for ordinary humans but difficult for AI becomes simply impossible,” Chollet writes.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *