SOTA on the ARC-AGI Benchmark
Poetiq's systems establish entirely new Pareto frontiers on both ARC-AGI-1 and ARC-AGI-2
(Figures 1 and 2), surpassing previous results and pushing the boundary for what is possible in
cost-effective reasoning. We highlight a few interesting points, with emphasis given to our system’s
configuration using models released in the last week; GPT-5.1 on November 13, 2025 and Gemini 3 on
November 18, 2025.
- Poetiq (Mix) used both the latest Gemini 3 and GPT-5.1 models. Compare with Gemini 3
Deep Think (Preview) which is significantly more expensive and has lower accuracy.
- Poetiq (Gemini-3-a,b,c) are examples of how Poetiq can leverage multiple LLMs to
maximize performance at any target cost. Poetiq discovered a straight-forward method
to achieve pareto-optimal solutions across a wide swath of operating regimes by using
multiple Gemini-3 calls to programmatically address these problems (both on ARC-AGI-1 and
ARC-AGI-2). We have open-sourced the code for
these systems.
- Poetiq (Grok-4-Fast) emphasizes cost and is built on top of the Grok 4 Fast Reasoning
model. In fact, it is both cheaper and more accurate than the underlying model’s
reported numbers (see below for more details). It achieves accuracy rivaling models that are
over two orders of magnitude more expensive.
- Poetiq (GPT-OSS-b) is built on top of the open weights GPT-OSS-120B model and shows
remarkable accuracy for less than 1 cent per problem (Figure 1).
- Poetiq (GPT-OSS-a) is built on top of the GPT-OSS-120B low thinking model. This point
is included to show system performance at extreme cost savings levels (Figure 1).
All these points (and more), while being capable separate systems in their own right, are produced
by the underlying, flexible, Poetiq meta-system. One of the meta-system’s core strengths is
automatically selecting combinations of models and approaches, even deciding when to write
any code, and to which models to assign coding tasks. Our recursive, self-improving, system is
LLM-agnostic and demonstrates its abilities with the state-of-the-art models.
Four observations:
- Note that Poetiq (Gemini-3-b) is saturating the performance on ARC-AGI-1; allowing larger
computation expenditure, Poetiq (Gemini-3-c), did not provide benefit. However, on
ARC-AGI-2, performance continues improving.
- All of Poetiq’s meta-system’s adaptation was done
prior to the release of the Gemini 3 and GPT-5.1. Additionally, it was never shown problems
from ARC-AGI-2. Further, for cost efficiency, the Poetiq system only relied on open-source
models for adaptation. The results from that adaptation (the basis for all of the systems
shown) were then used on both ARC-AGI-1 & 2, and also with over a dozen different underlying
LLM models (shown below in Figure 3). This indicates substantial transference and
generalization in the results of Poetiq’s system across model versions, families, and sizes.
We have observed this type of generalization on other problems as well.
- Our ARC-AGI-2 results have exceeded the performance of the average human test-taker (60%).
- As is described below (see final section), most of the underlying LLM models suffer varying
degrees of performance degradation when moving from Public to Semi-Private evaluation on
ARC-AGI-1. We expect the same. Most models have seen a smaller difference in performance on
ARC-AGI-2 as the sets are more closely calibrated. All results reported here, for our work
and everyone else’s, are on the public evaluation sets. See our analysis below.
Using Poetiq’s Meta-System to Improve Performance of Popular Models
To further illustrate the benefits of Poetiq’s meta-system we apply our technique to popular recent
models from Google DeepMind, OpenAI, Anthropic, and xAI. In each case, our system improves the
accuracy while reducing the cost. How is this even possible? Our systems achieve this because
they make only a single attempt that uses fewer than two requests on average, rather than the
two attempts that ARC-AGI permits. Figure 3 shows this on ARC-AGI-1 for 12 models from a variety of model
families: GPT, Claude Haiku, Gemini, Grok 4, and GPT-OSS.
Figure 3: Poetiq’s systems improve existing models across model sizes and families (with open or
closed weights), in terms of both accuracy and cost.
It’s LLMs all the way down. We used LLMs to build, improve, and power the system. This
flexible,
powerful, and recursive architecture is what allowed our small team to rapidly achieve this suite of
state-of-the-art results.
The specific configurations that we are
open-sourcing were chosen to illustrate two key
principles:
- The prompt is an interface, not the intelligence: Our system engages in an iterative
problem-solving loop. It doesn't just ask a single question; it uses the LLM to generate a
potential solution (sometimes code as in this example), receives feedback, analyzes the
feedback, and then uses the LLM again to refine it. This multi-step, self-improving process
allows us to incrementally build and perfect the answer.
- Self-Auditing: The system autonomously audits its own progress. It decides for itself
when it has enough information and the solution is satisfactory, allowing it to terminate
the process. This self-monitoring is critical for avoiding wasteful computation and
minimizing costs.
We hope that our open-source code will help inspire new ideas and accelerate the path to
superintelligence. The official code is available on
Github.
ARC-AGI provides an ideal test bed to firmly establish one of our core tenets – LLMs contain much of
humanity’s knowledge, but often struggle with tasks that rely on more complex reasoning. While the
performance of an LLM heavily relies on the query, their inherent stochasticity makes knowledge
extraction unreliable and makes the reasoning steps unpredictable. The challenge lies in discovering
a reasoning strategy that can both find the necessary pieces of information and assemble them when
they are discovered to intelligently determine what information is needed next.
At Poetiq, automating and optimizing this process is one of our key goals. We are building
technology to optimize the extraction of this fragmented knowledge for complex reasoning tasks by
not a priori dictating, but rather discovering, appropriate reasoning strategies that are both
adaptive to the underlying LLM and work within specified real-world constraints (budgets, tokens, or
compute). This will unlock the rapid progress in AI that the technology promises. Our system is
designed to very quickly adapt to the specifics of the task and the model. ARC-AGI provides a
concrete demonstration of this. For ARC-AGI, our system discovered an elegant method to improve
performance across the entire frontier!
At Poetiq, our core meta-system produces optimized agents to automate the extraction of knowledge
for hard tasks that require complex reasoning. We optimize every part of the process: developing
better strategies for determining what to ask, refining sequential chain-of-questions, and devising
fundamental new methods for assembling the answers. ARC-AGI is just the beginning – we’ve tackled
several other benchmarks as well, with similarly compelling results. Watch this space for more
information on those, as well as other fun demonstrations of our capabilities.
Poetiq is a lean, deeply technical team of 6 researchers and engineers with a combined 53 years of
experience from Google DeepMind. We're focused on solving the fundamental problems of AI reasoning
and knowledge extraction in the presence of noise and uncertainty. Want to join us?
Check out our open positions.
We're excited to share this result with the community and look forward to the discussion. Let us
know your thoughts at
poetiq@poetiq.ai
.