SOTA on the ARC-AGI Benchmark
Poetiq's systems
establish entirely new Pareto frontiers on both ARC-AGI-1 and ARC-AGI-2
(Figures 1 and 2), surpassing previous results and pushing the boundary for what is possible in
cost-effective reasoning. We highlight a few interesting points, with emphasis given to our system’s
configuration using models released in the last week; GPT-5.1 on November 13, 2025 and Gemini
3 on November 18, 2025.
- Poetiq (Mix) used both the latest Gemini 3 and GPT-5.1 models. Compare with Gemini
3 Deep Think (Preview) which is significantly more expensive and has lower accuracy.
- Poetiq (Gemini-3-a,b,c) are examples of how Poetiq can leverage
multiple LLMs to maximize performance at any target cost. Poetiq discovered a
straight-forward method to achieve pareto-optimal solutions across a wide swath of
operating regimes by using multiple Gemini-3 calls to programmatically address these
problems (both on ARC-AGI-1 and ARC-AGI-2). We have
open-sourced
the code for these systems.
- Poetiq (Grok-4-Fast) emphasizes cost and is built on top of the Grok 4 Fast Reasoning
model. In fact, it is both cheaper and more accurate than the underlying model’s reported
numbers (see below for more details). It achieves accuracy rivaling models that are over two
orders of magnitude more expensive.
- Poetiq (GPT-OSS-b) is built on top of the open weights GPT-OSS-120B model and shows
remarkable accuracy for less than 1 cent per problem (Figure 1).
- Poetiq (GPT-OSS-a) is built on top of the GPT-OSS-120B low thinking model. This point
is included to show system performance at extreme cost savings levels (Figure 1).
All these points (and more), while being capable separate systems in their own right, are
produced by the underlying, flexible, Poetiq meta-system. One of the meta-system’s core
strengths is automatically selecting combinations of models and approaches, even
deciding when to write any code, and to which models to assign coding tasks. Our recursive,
self-improving, system is LLM-agnostic and demonstrates its abilities with the
state-of-the-art models.
Four observations:
-
Note that Poetiq (Gemini-3-b) is saturating the performance on ARC-AGI-1; allowing
larger computation expenditure, Poetiq (Gemini-3-c), did not provide benefit. However,
on ARC-AGI-2, performance continues improving.
- All of Poetiq’s meta-system’s adaptation was done prior
to the release of the Gemini 3 and GPT-5.1. Additionally, it was never shown problems from
ARC-AGI-2. Further, for cost efficiency, the Poetiq system only relied on open-source models
for adaptation. The results from that adaptation (the basis for all of the systems shown)
were then used on both ARC-AGI-1 & 2, and also with over a dozen different underlying LLM
models (shown below in Figure 3). This indicates substantial transference and generalization
in the results of Poetiq’s system across model versions, families, and sizes. We have observed
this type of generalization on other problems as well.
-
Our ARC-AGI-2 results have exceeded the performance of the average human test-taker (60%).
-
As is described below (see final section), most of the underlying LLM models suffer
varying degrees of performance degradation when moving from Public to Semi-Private
evaluation on ARC-AGI-1. We expect the same. Most models have seen a smaller difference
in performance on ARC-AGI-2 as the sets are more closely calibrated. All results
reported here, for our work and everyone else’s, are on the public evaluation sets. See our analysis below.
Using Poetiq’s Meta-System to Improve Performance of Popular Models
To further illustrate the benefits of Poetiq’s meta-system we apply our technique to popular
recent models from Google DeepMind, OpenAI, Anthropic, and xAI. In each case, our system
improves the accuracy while reducing the cost.
How is this even possible? Our systems achieve this because they make only a single attempt
that uses fewer than two requests on average, rather than the
two attempts that ARC-AGI permits. Figure 3 shows this on ARC-AGI-1 for 12 models from
a variety of model families: GPT, Claude Haiku, Gemini, Grok 4, and GPT-OSS.
Figure 3: Poetiq’s systems improve existing models across model sizes and families (with
open or closed weights), in terms of both accuracy and cost.
It’s LLMs all the way down. We used LLMs to build, improve, and power the system. This
flexible, powerful, and recursive architecture is what allowed our small team to rapidly achieve
this suite of state-of-the-art results.
The specific configurations that we are
open-sourcing were chosen to illustrate two key
principles:
- The prompt is an interface, not the intelligence: Our system engages in an iterative
problem-solving loop. It doesn't just ask a single question; it uses the LLM to generate a
potential solution (sometimes code as in this example), receives feedback, analyzes the feedback,
and then uses the LLM again to refine it. This multi-step, self-improving process allows us
to incrementally build and perfect the answer.
- Self-Auditing: The system autonomously audits its own progress. It decides for itself
when it has enough information and the solution is satisfactory, allowing it to terminate
the process. This self-monitoring is critical for avoiding wasteful computation and minimizing
costs.
We hope that our open-source code will help inspire new ideas and accelerate the path to
superintelligence. The official code is available on
Github.
ARC-AGI provides an ideal test bed to firmly establish one of our core tenets – LLMs contain
much of humanity’s knowledge, but often struggle with tasks that rely on more complex
reasoning. While the performance of an LLM heavily relies on the query, their inherent
stochasticity makes knowledge extraction unreliable and makes the reasoning steps
unpredictable. The challenge lies in discovering a reasoning strategy that can both find the
necessary pieces of information and assemble them when they are discovered to intelligently
determine what information is needed next.
At Poetiq, automating and optimizing this process is one of our key goals. We are building
technology to optimize the extraction of this fragmented knowledge for complex reasoning
tasks by not a priori dictating, but rather discovering, appropriate reasoning strategies
that are both adaptive to the underlying LLM and work within specified real-world
constraints (budgets, tokens, or compute). This will unlock the rapid progress in AI that
the technology promises. Our system is designed to very quickly adapt to the specifics of
the task and the model. ARC-AGI provides a concrete demonstration of this. For ARC-AGI, our
system discovered an elegant method to improve performance across the entire frontier!
At Poetiq, our core meta-system produces optimized agents to automate the extraction of
knowledge for hard tasks that require complex reasoning. We optimize every part of the
process: developing better strategies for determining what to ask, refining sequential
chain-of-questions, and devising fundamental new methods for assembling the answers. ARC-AGI
is just the beginning – we’ve tackled several other benchmarks as well, with similarly
compelling results. Watch this space for more information on those, as well as other fun
demonstrations of our capabilities.
Poetiq is a lean, deeply technical team of 6 researchers and engineers with a combined 53
years of experience from Google DeepMind. We're focused on solving the fundamental problems
of AI reasoning and knowledge extraction in the presence of noise and uncertainty. Want to
join us?
Check out our open positions.
We're excited to share this result with the community and look forward to the discussion.
Let us know your thoughts at
poetiq@poetiq.ai
.