SOTA on the ARC-AGI Benchmark
Poetiq's systems
establish entirely new Pareto frontiers on both ARC-AGI-1 and ARC-AGI-2
(Figures 1 and 2), surpassing previous results and pushing the boundary for what is
possible in cost-effective reasoning. We highlight a few interesting points, with
emphasis given to our system’s configuration using models released in the last week;
GPT-5.1 on November 13, 2025 and Gemini 3 on November 18, 2025.
-
Poetiq (Mix) used both the latest Gemini 3 and GPT-5.1 models. Compare with
Gemini 3 Deep Think (Preview) which is significantly more expensive and has lower
accuracy.
-
Poetiq (Gemini-3-a,b,c) are examples of how Poetiq can leverage
multiple LLMs to maximize performance at any target cost. Poetiq discovered a
straight-forward method to achieve pareto-optimal solutions across a wide swath of
operating regimes by using multiple Gemini-3 calls to programmatically address these
problems (both on ARC-AGI-1 and ARC-AGI-2). We have
open-sourced
the code for these systems.
-
Poetiq (Grok-4-Fast) emphasizes cost and is built on top of the Grok 4 Fast
Reasoning model. In fact, it is both cheaper and more accurate than the
underlying model’s reported numbers (see below for more details). It achieves
accuracy rivaling models that are over two orders of magnitude more expensive.
-
Poetiq (GPT-OSS-b) is built on top of the open weights GPT-OSS-120B model and
shows remarkable accuracy for less than 1 cent per problem (Figure 1).
-
Poetiq (GPT-OSS-a) is built on top of the GPT-OSS-120B low thinking model.
This point is included to show system performance at extreme cost savings levels
(Figure 1).
All these points (and more), while being capable separate systems in their own right,
are produced by the underlying, flexible, Poetiq meta-system. One of the meta-system’s
core strengths is automatically selecting combinations of models and approaches,
even deciding when to write any code, and to which models to assign coding tasks. Our
recursive, self-improving, system is LLM-agnostic and demonstrates its abilities with
the state-of-the-art models.
Four observations:
-
Note that Poetiq (Gemini-3-b) is saturating the performance on ARC-AGI-1; allowing
larger computation expenditure, Poetiq (Gemini-3-c), did not provide benefit.
However, on ARC-AGI-2, performance continues improving.
-
All of Poetiq’s meta-system’s adaptation was
done prior to the release of the Gemini 3 and GPT-5.1. Additionally, it was never
shown problems from ARC-AGI-2. Further, for cost efficiency, the Poetiq system only
relied on open-source models for adaptation. The results from that adaptation (the
basis for all of the systems shown) were then used on both ARC-AGI-1 & 2, and also
with over a dozen different underlying LLM models (shown below in Figure 3). This
indicates substantial transference and generalization in the results of Poetiq’s
system across model versions, families, and sizes. We have observed this type of
generalization on other problems as well.
-
Our ARC-AGI-2 results have exceeded the performance of the average human test-taker
(60%).
-
As is described below (see final section), most of the underlying LLM models suffer
varying degrees of performance degradation when moving from Public to Semi-Private
evaluation on ARC-AGI-1. We expect the same. Most models have seen a smaller
difference in performance on ARC-AGI-2 as the sets are more closely calibrated. All
results reported here, for our work and everyone else’s, are on the public
evaluation sets. See our analysis below.
Using Poetiq’s Meta-System to Improve Performance of Popular Models
To further illustrate the benefits of Poetiq’s meta-system we apply our technique to
popular recent models from Google DeepMind, OpenAI, Anthropic, and xAI. In each case,
our system improves the accuracy while reducing the cost.
How is this even possible? Our systems achieve this because they make only a
single attempt that uses fewer than two requests on average, rather than the
two attempts that ARC-AGI permits. Figure 3 shows this on ARC-AGI-1 for 12 models
from a variety of model families: GPT, Claude Haiku, Gemini, Grok 4, and GPT-OSS.
Figure 3: Poetiq’s systems improve existing models across model sizes and families (with
open or closed weights), in terms of both accuracy and cost.
It’s LLMs all the way down. We used LLMs to build, improve, and power the system.
This flexible, powerful, and recursive architecture is what allowed our small team to
rapidly achieve this suite of state-of-the-art results.
The specific configurations that we are
open-sourcing were chosen to illustrate
two key principles:
-
The prompt is an interface, not the intelligence: Our system engages in an
iterative problem-solving loop. It doesn't just ask a single question; it uses the
LLM to generate a potential solution (sometimes code as in this example), receives
feedback, analyzes the feedback, and then uses the LLM again to refine it. This
multi-step, self-improving process allows us to incrementally build and perfect the
answer.
-
Self-Auditing: The system autonomously audits its own progress. It decides
for itself when it has enough information and the solution is satisfactory, allowing
it to terminate the process. This self-monitoring is critical for avoiding wasteful
computation and minimizing costs.
We hope that our open-source code will help inspire new ideas and accelerate the path to
superintelligence. The official code is available on
Github.
ARC-AGI provides an ideal test bed to firmly establish one of our core tenets – LLMs
contain much of humanity’s knowledge, but often struggle with tasks that rely on more
complex reasoning. While the performance of an LLM heavily relies on the query, their
inherent stochasticity makes knowledge extraction unreliable and makes the reasoning
steps unpredictable. The challenge lies in discovering a reasoning strategy that can
both find the necessary pieces of information and assemble them when they are discovered
to intelligently determine what information is needed next.
At Poetiq, automating and optimizing this process is one of our key goals. We are
building technology to optimize the extraction of this fragmented knowledge for complex
reasoning tasks by not a priori dictating, but rather discovering, appropriate reasoning
strategies that are both adaptive to the underlying LLM and work within specified
real-world constraints (budgets, tokens, or compute). This will unlock the rapid
progress in AI that the technology promises. Our system is designed to very quickly
adapt to the specifics of the task and the model. ARC-AGI provides a concrete
demonstration of this. For ARC-AGI, our system discovered an elegant method to improve
performance across the entire frontier!
At Poetiq, our core meta-system produces optimized agents to automate the extraction of
knowledge for hard tasks that require complex reasoning. We optimize every part of the
process: developing better strategies for determining what to ask, refining sequential
chain-of-questions, and devising fundamental new methods for assembling the answers.
ARC-AGI is just the beginning – we’ve tackled several other benchmarks as well, with
similarly compelling results. Watch this space for more information on those, as well as
other fun demonstrations of our capabilities.
Poetiq is a lean, deeply technical team of 6 researchers and engineers with a combined
53 years of experience from Google DeepMind. We're focused on solving the fundamental
problems of AI reasoning and knowledge extraction in the presence of noise and
uncertainty. Want to join us?
Check out our open positions.
We're excited to share this result with the community and look forward to the
discussion. Let us know your thoughts at
poetiq@poetiq.ai
.