Figure 1: The Poetiq harness achieves SOTA performance on HLE. The harness integrates models
from multiple model families, yielding multiple paths to new SOTA results. Below, we’ll show
similar improvements with single models.
One of the core tenets of Poetiq is that our systems can be deployed to orchestrate
any
model; we are not tied to a specific model or model family. We are happy to report improved performance
across all models tested. We further demonstrate improved performance through mixing models. As
with our previous results on
ARC-AGI,
we have taken care to ensure all baselines reflect the reported state-of-the-art for every
model we compare against, in order to ensure fair and unbiased comparisons.
Humanity's Last Exam
Humanity's Last Exam (HLE) is
a comprehensive, multi-modal academic benchmark positioned at the forefront of human knowledge,
intended to be the conclusive close-ended benchmark of its kind.
New York Times commented: “When A.I. Passes This Test, Look Out”. We’re getting closer. Unlike our previous
results with ARC-AGI, which concentrated purely on improving reasoning capabilities, HLE requires
combining reasoning with deep knowledge extraction from the model. Additionally, when used with
tools, this test suite also measures how well the model can find and synthesize new knowledge.
HLE defines two tracks – one that uses only the knowledge in the LLMs without any tools or
access to the web, and another that allows an LLM access to tools, including the ability to
execute code and perform web searches.
In Figure 1, we examine the performance of models with tools. Through the use of Poetiq’s
harness, we are able to surpass all state of the art methods including others like Zoom AI’s
that also use multiple models. To highlight the flexibility of Poetiq’s harness, we note
that the most recently released model (Claude Opus 4.6, released on February 5, 2026), was
integrated into our harness within 24 hours of release and yielded further improvements to
our SOTA results. By combining the results from all three of the top model families, our
system was able to achieve 55.0% (~2% over the previous SOTA).
There were many other combinations of models that achieved SOTA results as well. Two are
shown: if we ablate Claude Opus 4.6, we get Poetiq Gemini 3 Pro & Flash, GPT 5.2 High = 53.8%. Further ablating GPT 5.2 High, we get Poetiq Gemini 3 Pro & Flash = 53.7%.
The performance of HLE is often measured without tool use as well. This means no web access
and no python code execution. This is a pure measure of the knowledge encoded within the
LLM’s weights and the ability to retrieve and reason with that knowledge. Here, Claude Opus
4.6 was the leader with an accuracy of 40.0%. We applied our harness to the second best
model (Gemini 3 Pro) and overtook the leader – see Figure 2. This demonstrates the ability
of our harness to extract more accurate knowledge from the underlying LLM and to reason more
effectively.
Figure 2: The Poetiq harness substantially improves the performance of Gemini 3 Pro, the second best model, to surpass the recently released top-performing model, Claude Opus 4.6.
Throughout this blog, all non-Poetiq results are taken from model cards, official announcements,
or scale.ai evaluations.
In addition to the performance of Poetiq (Gemini 3 Pro), we also used Poetiq with the
weakest model on the graph above, Gemini 3 Flash (not shown). This yielded large
improvements again; it improved Gemini 3 Flash’s performance to 38.7% (+5.0%
improvement). Gemini 3 Flash with our harness is so much improved that it also
overtakes the performance of more powerful models such as
Gemini 3 Pro, GPT 5.2 X-High, and GPT 5.2 Pro. This exemplifies one of
Poetiq’s core strengths: by using our harness, we can strengthen older, cheaper, and smaller
models to outperform more recent, larger, and/or more expensive ones.
One of the core tenets of Poetiq is that our systems can be deployed to orchestrate any
model; we are not tied to a specific model or model family. To further illustrate the benefits
of Poetiq’s meta-system, we verified our harness on models from
Google, OpenAI and Anthropic. To reduce confounding factors and focus on
the model’s knowledge and reasoning capabilities, these runs were conducted on pure text
problems (no images) and without access to tools (i.e., no code execution or web access).
A representative sample of the results are shown in Figure 3.
The Poetiq harness achieves lift over every model tested, demonstrating the
generality of our approach.
Figure 3: The Poetiq harness improves every model tested.
1 We avoid potential confounding factors in these experiments by working in the text-only, no
tools setting. *Note that we cannot compare to Opus 4.6 as they have not reported text-only results.
SimpleQA Verified
SimpleQA Verified, released in September 2025 by Google, is described as “A factuality benchmark … that
measures the ability for language models to answer short, fact-seeking questions”. Like HLE,
this benchmark focuses on examining how much knowledge the frontier model has encoded, but
with a specific focus on the LLM’s
parametric knowledge and its ability to avoid
hallucinations on short-form, indisputable facts (
e.g., specific dates or names).
The questions usually have a single, objective answer, making it easy to grade. However,
many frontier models currently struggle to answer correctly.
As on HLE, we show in Figure 4 that the Poetiq harness can improve the performance of every
model tested. As before, using a combination of models within Poetiq’s harness yields the
best performance. Since SimpleQA Verified is designed for testing without tools or
internet-access, we used the same harness that we used to obtain the HLE without tools results.
Figure 4: The same Poetiq harness used for HLE without tools improves every model tested
on SimpleQA Verified. Additionally, the Poetiq mixture yields a new SOTA of 77.3% (+5.2% improvement
over the previous SOTA).
About the Results
Like our results on
ARC-AGI, it’s LLMs all the way
down. Poetiq’s meta-system uses LLMs to build, improve, and power these task-specific
generated systems. The flexible, powerful, and recursive architecture of the meta-system is
what allowed our small team to rapidly achieve this suite of state-of-the-art results. The
outputs of our meta-system revealed a number of surprising findings:
-
LLMs solve many of these problems inconsistently. Our verification and testing
techniques allowed us to overcome these inconsistencies in many cases.
-
When we allowed the use of tools, our system selected if and when to write code. The
results were not always intuitive – in many problems where coding was not obviously
beneficial, the harness correctly solved them by writing code, and in other problems
that appeared algorithmic in nature, the harness was able to directly extract the
correct answer from the model.
-
We again saw that LLMs possess an enormous amount of untapped information, allowing them
to address many more tasks than public results and published model cards suggested.
Poetiq’s harness surfaces that information.
-
Unlike with ARC-AGI, the Poetiq harness was hierarchical in nature, segmenting the
problems into automatically-selected buckets.
-
Employing multiple different models and model families can substantially improve
performance in this task. We hypothesize that this effect may be pronounced in tasks,
such as this one, where diverse knowledge retrieval is critical.
LLMs hold much of humanity’s knowledge – they are absolutely amazing databases. And they are
constantly improving. Enormous effort, expense, and resources are being expended by frontier
model creators to ensure that ever more information is encoded into their models’ weights.
However, there is a fundamental imbalance between encoding more information and being able
to retrieve it effectively for solving real problems. For these problems, the challenge lies
not only in discovering an improved reasoning strategy (as was the case with ARC-AGI), but
rather probing the models intelligently to find the numerous pieces of information that each
problem requires.
There is a fundamental imbalance between encoding more information in LLMs and being able to
retrieve it effectively.
At Poetiq, we are building technology to automate and optimize the extraction of this
fragmented knowledge for complex tasks. We don’t dictate rules a priori, but
instead discover the appropriate questions to ask. Every new piece of information dictates
what’s asked next.
So, What's Next?
At Poetiq, our core meta-system is engineered to automate the extraction of knowledge for
challenging tasks, producing highly optimized agents, harnesses, and orchestrators. This
capability extends across a wide spectrum of tasks, from solving complex reasoning problems,
like in ARC-AGI, to assembling diverse, scattered knowledge found deep within an LLM's
weights, like in HLE and SimpleQA Verified.
We optimize every part of the process: developing better strategies for determining what to
ask, refining sequential chain-of-questions, and devising fundamental new methods for
assembling the answers.
Since our emergence from stealth in November 2025, we've publicly demonstrated our approach
on benchmarks requiring reasoning, knowledge retrieval, and advanced tool use. Each exposed
unique hurdles in maximizing LLM performance. So what’s next? We’ll have more demonstrations
of the Poetiq meta-system’s capabilities soon – watch this space! We are also engaging with
early customers – if you’re excited to be one of the first to try Poetiq on your problems,
be sure to request early access!
Join the Journey
Poetiq is a lean, deeply technical team of 7 researchers and engineers with a combined 72
years of experience from Google/DeepMind. We're focused on solving the fundamental problems
of AI reasoning and knowledge extraction in the presence of noise and uncertainty. Want to
join us?
Check out our open positions.
For all models other than Gemini 3 Pro, we used only minimal versions
of our harness to save on compute costs (we’re still a startup!). We expect that using full
versions of our harness will result in further performance improvements.