Raising the Bar on Humanity’s Last Exam (HLE) and SimpleQA

HLE with Tools

55.0

Poetiq Gemini + GPT + Claude

53.8

Poetiq Gemini + GPT

53.7

Poetiq Gemini

53.1

Claude Opus 4.6

53.0

Zoom AI

50.2

Kimi K2.5

45.8

Gemini 3 Pro

45.5

GPT 5.2 X-High

43.5

Gemini 3 Flash

Figure 1: The Poetiq harness achieves SOTA performance on HLE. The harness integrates models from multiple model families, yielding multiple paths to new SOTA results. Below, we’ll show similar improvements with single models.

One of the core tenets of Poetiq is that our systems can be deployed to orchestrate any model; we are not tied to a specific model or model family. We are happy to report improved performance across all models tested. We further demonstrate improved performance through mixing models. As with our previous results on ARC-AGI, we have taken care to ensure all baselines reflect the reported state-of-the-art for every model we compare against, in order to ensure fair and unbiased comparisons.

Humanity's Last Exam

Humanity's Last Exam (HLE) is a comprehensive, multi-modal academic benchmark positioned at the forefront of human knowledge, intended to be the conclusive close-ended benchmark of its kind. New York Times commented: “When A.I. Passes This Test, Look Out”. We’re getting closer. Unlike our previous results with ARC-AGI, which concentrated purely on improving reasoning capabilities, HLE requires combining reasoning with deep knowledge extraction from the model. Additionally, when used with tools, this test suite also measures how well the model can find and synthesize new knowledge.

HLE defines two tracks – one that uses only the knowledge in the LLMs without any tools or access to the web, and another that allows an LLM access to tools, including the ability to execute code and perform web searches.

Poetiq Harness on HLE with Tools

In Figure 1, we examine the performance of models with tools. Through the use of Poetiq’s harness, we are able to surpass all state of the art methods including others like Zoom AI’s that also use multiple models. To highlight the flexibility of Poetiq’s harness, we note that the most recently released model (Claude Opus 4.6, released on February 5, 2026), was integrated into our harness within 24 hours of release and yielded further improvements to our SOTA results. By combining the results from all three of the top model families, our system was able to achieve 55.0% (~2% over the previous SOTA).

There were many other combinations of models that achieved SOTA results as well. Two are shown: if we ablate Claude Opus 4.6, we get Poetiq Gemini 3 Pro & Flash, GPT 5.2 High = 53.8%. Further ablating GPT 5.2 High, we get Poetiq Gemini 3 Pro & Flash = 53.7%.

Poetiq Harness on HLE without Tools

The performance of HLE is often measured without tool use as well. This means no web access and no python code execution. This is a pure measure of the knowledge encoded within the LLM’s weights and the ability to retrieve and reason with that knowledge. Here, Claude Opus 4.6 was the leader with an accuracy of 40.0%. We applied our harness to the second best model (Gemini 3 Pro) and overtook the leader – see Figure 2. This demonstrates the ability of our harness to extract more accurate knowledge from the underlying LLM and to reason more effectively.

HLE without Tools

42.5

Poetiq Gemini 3 Pro

40.0

Claude Opus 4.6

37.5

Gemini 3 Pro

36.6

GPT 5.2 Pro

34.5

GPT 5.2 X-High

33.7

Gemini 3 Flash

Figure 2: The Poetiq harness substantially improves the performance of Gemini 3 Pro, the second best model, to surpass the recently released top-performing model, Claude Opus 4.6. Throughout this blog, all non-Poetiq results are taken from model cards, official announcements, or scale.ai evaluations.

In addition to the performance of Poetiq (Gemini 3 Pro), we also used Poetiq with the weakest model on the graph above, Gemini 3 Flash (not shown). This yielded large improvements again; it improved Gemini 3 Flash’s performance to 38.7% (+5.0% improvement). Gemini 3 Flash with our harness is so much improved that it also overtakes the performance of more powerful models such as Gemini 3 Pro, GPT 5.2 X-High, and GPT 5.2 Pro. This exemplifies one of Poetiq’s core strengths: by using our harness, we can strengthen older, cheaper, and smaller models to outperform more recent, larger, and/or more expensive ones.

Improving the Performance of Models from Every Major Provider

One of the core tenets of Poetiq is that our systems can be deployed to orchestrate any model; we are not tied to a specific model or model family. To further illustrate the benefits of Poetiq’s meta-system, we verified our harness on models from Google, OpenAI and Anthropic. To reduce confounding factors and focus on the model’s knowledge and reasoning capabilities, these runs were conducted on pure text problems (no images) and without access to tools (i.e., no code execution or web access). A representative sample of the results are shown in Figure 3. The Poetiq harness achieves lift over every model tested, demonstrating the generality of our approach.

Poetiq on HLE

Gemini 3 Pro

37.7%

GPT-5.2 high thinking

28.5%

Claude Sonnet 4.5

14.1%

Poetiq Improvement

Figure 3: The Poetiq harness improves every model tested.¹ We avoid potential confounding factors in these experiments by working in the text-only, no tools setting. *Note that we cannot compare to Opus 4.6 as they have not reported text-only results.

SimpleQA Verified

SimpleQA Verified, released in September 2025 by Google, is described as “A factuality benchmark … that measures the ability for language models to answer short, fact-seeking questions”. Like HLE, this benchmark focuses on examining how much knowledge the frontier model has encoded, but with a specific focus on the LLM’s parametric knowledge and its ability to avoid hallucinations on short-form, indisputable facts (e.g., specific dates or names). The questions usually have a single, objective answer, making it easy to grade. However, many frontier models currently struggle to answer correctly.

As on HLE, we show in Figure 4 that the Poetiq harness can improve the performance of every model tested. As before, using a combination of models within Poetiq’s harness yields the best performance. Since SimpleQA Verified is designed for testing without tools or internet-access, we used the same harness that we used to obtain the HLE without tools results.

Poetiq on SimpleQA Verified

Poetiq Ensemble

77.3%

Gemini 3 Pro

72.1%

GPT 5.0

51.6%

GPT 5.1

34.9%

Claude Sonnet 4.5

29.2%

Poetiq Improvement

Figure 4: The same Poetiq harness used for HLE without tools improves every model tested on SimpleQA Verified. Additionally, the Poetiq mixture yields a new SOTA of 77.3% (+5.2% improvement over the previous SOTA).

About the Results

How We Did It

Like our results on ARC-AGI, it’s LLMs all the way down. Poetiq’s meta-system uses LLMs to build, improve, and power these task-specific generated systems. The flexible, powerful, and recursive architecture of the meta-system is what allowed our small team to rapidly achieve this suite of state-of-the-art results. The outputs of our meta-system revealed a number of surprising findings:

LLMs solve many of these problems inconsistently. Our verification and testing techniques allowed us to overcome these inconsistencies in many cases.
When we allowed the use of tools, our system selected if and when to write code. The results were not always intuitive – in many problems where coding was not obviously beneficial, the harness correctly solved them by writing code, and in other problems that appeared algorithmic in nature, the harness was able to directly extract the correct answer from the model.
We again saw that LLMs possess an enormous amount of untapped information, allowing them to address many more tasks than public results and published model cards suggested. Poetiq’s harness surfaces that information.
Unlike with ARC-AGI, the Poetiq harness was hierarchical in nature, segmenting the problems into automatically-selected buckets.
Employing multiple different models and model families can substantially improve performance in this task. We hypothesize that this effect may be pronounced in tasks, such as this one, where diverse knowledge retrieval is critical.

Why We Did It

LLMs hold much of humanity’s knowledge – they are absolutely amazing databases. And they are constantly improving. Enormous effort, expense, and resources are being expended by frontier model creators to ensure that ever more information is encoded into their models’ weights. However, there is a fundamental imbalance between encoding more information and being able to retrieve it effectively for solving real problems. For these problems, the challenge lies not only in discovering an improved reasoning strategy (as was the case with ARC-AGI), but rather probing the models intelligently to find the numerous pieces of information that each problem requires.

There is a fundamental imbalance between encoding more information in LLMs and being able to retrieve it effectively.

At Poetiq, we are building technology to automate and optimize the extraction of this fragmented knowledge for complex tasks. We don’t dictate rules a priori, but instead discover the appropriate questions to ask. Every new piece of information dictates what’s asked next.

So, What's Next?

At Poetiq, our core meta-system is engineered to automate the extraction of knowledge for challenging tasks, producing highly optimized agents, harnesses, and orchestrators. This capability extends across a wide spectrum of tasks, from solving complex reasoning problems, like in ARC-AGI, to assembling diverse, scattered knowledge found deep within an LLM's weights, like in HLE and SimpleQA Verified.

We optimize every part of the process: developing better strategies for determining what to ask, refining sequential chain-of-questions, and devising fundamental new methods for assembling the answers.

Since our emergence from stealth in November 2025, we've publicly demonstrated our approach on benchmarks requiring reasoning, knowledge retrieval, and advanced tool use. Each exposed unique hurdles in maximizing LLM performance. So what’s next? We’ll have more demonstrations of the Poetiq meta-system’s capabilities soon – watch this space! We are also engaging with early customers – if you’re excited to be one of the first to try Poetiq on your problems, be sure to request early access!

Join the Journey

Poetiq is a lean, deeply technical team of 7 researchers and engineers with a combined 72 years of experience from Google/DeepMind. We're focused on solving the fundamental problems of AI reasoning and knowledge extraction in the presence of noise and uncertainty. Want to join us? Check out our open positions.

¹ For all models other than Gemini 3 Pro, we used only minimal versions of our harness to save on compute costs (we’re still a startup!). We expect that using full versions of our harness will result in further performance improvements.

get in touch

hello@poetiq.ai