OpenAI introduced the reasoning-oriented o3 array of artificial intelligence (AI) models last month. In a live broadcast, the company revealed the benchmark results of the model derived from internal evaluations. All shared scores were remarkable and underscored the enhanced abilities of the successor to o1, but one benchmark score was particularly noteworthy. On the ARC-AGI benchmark, the large language model (LLM) achieved 85 percent, surpassing the previous best score by a 30 percent margin. Interestingly, this score aligns with the average performance of a human on the same test.
OpenAI Achieves 85 Percent on ARC-AGI Benchmark
However, does o3’s impressive score on the test indicate that its intelligence equates to that of an average human? This question would be simpler to address if the AI model was made publicly available for testing. Since OpenAI has yet to reveal any information regarding the model’s architecture, training methods, or datasets, drawing definitive conclusions is challenging.
There are certain aspects we do know about the AI company’s reasoning-centric models that can aid our understanding of what to anticipate from OpenAI’s forthcoming LLM. Firstly, up to this point, the o-series models have not undergone significant changes in their architecture or framework but have been fine-tuned to demonstrate improved capabilities.
For example, developers employed a method called test-time compute with the o1 series of AI models. Through this technique, the AI models were allotted extra processing time to address a question and a workspace to experiment with theories and rectify any errors. Similarly, the GPT-4o model was merely a refined version of the GPT-4.
It seems improbable that the firm would have implemented substantial modifications to the architecture with the o3 model, considering it is also rumored to be developing the GPT-5 AI model, which could launch later this year.
Turning to the ARC-AGI (Abstract Reasoning Corpus – Artificial General Intelligence) benchmark, it comprises a sequence of grid-based pattern recognition challenges requiring reasoning and spatial comprehension abilities for resolution. This could involve utilizing a vast dataset of premium data concentrated on reasoning and aptitude-based logic.
Nonetheless, if it were that straightforward, earlier AI models would have achieved high scores on the test as well. Notably, the previous highest score was 55 percent compared to o3’s 85 percent score. This underscores that the developers have integrated novel refinement techniques and algorithms to improve the model’s reasoning capabilities. The full scope of these improvements cannot be outlined until OpenAI officially discloses the technical specifics.
That said, it is improbable that the o3 AI model would have attained AGI or human-level intelligence. Firstly, if that were true, it would signify the conclusion of the company’s alliance with Microsoft, set to end once OpenAI models achieve AGI status. Secondly, numerous AI specialists, including Geoffrey Hinton, the father of AI, have consistently pointed out that we are several years away from achieving AGI.
Finally, AGI represents such a significant achievement that if OpenAI were to attain that goal, they would clearly inform the public rather than providing subtle suggestions about it. What is much more probable in this case is that the o3 AI model has discovered a method to enhance the pattern-based reasoning abilities of the model (either through the addition of sufficient sampling data or by adjusting the training techniques), as noted in a PTI report as well.