We ran Poetiq's Meta-System on a coding benchmark, let it construct and optimize its own
harnesses from scratch, and delivered improvements across all models tested,
open-weights and proprietary. No fine-tuning, no special access, no hand-built pipelines.
Automatically Improving Beyond SOTA on LiveCodeBench Pro
LiveCodeBench Pro (LCB Pro) is an authoritative coding benchmark; for a solution to be
successful, it must not only produce the correct answer but also satisfy specific memory
and runtime constraints. Importantly, this benchmark is explicitly designed to mitigate
LLM data contamination. The testing suite is subject to continuous updates,
distinguishing it from many standard benchmarks. It further avoids overfitting by
withholding public ground-truth code; instead, it utilizes a comprehensive testing
framework that validates generated solutions against required outputs. The problems in
the benchmark come from major coding competitions.
LCB Pro emphasizes creative coding by utilizing difficult C++ challenges, which
effectively test an AI's capacity for complex problem-solving. This distinguishes it from
datasets like SWEBench that evaluate tool usage or bug-fixing workflows. Ultimately, the
benchmark provides a pure assessment of the model's inherent programming abilities and
its capacity to generate high-quality, performant procedural logic.
We ran Poetiq's Meta-System on LCB Pro, letting it create a custom LCB harness from
scratch, optimized for Gemini 3.1 Pro. The
harness was then tested on many other models from different providers and different
generations, both open source and proprietary. The new State of the Art (SOTA) results
are shown in Figure 1.
Figure 1: Our automatically-created harness achieves a new state-of-the-art (SOTA) on
LiveCodeBench Pro (25Q2), outperforming GPT 5.5 High by 4.3% using the same base model.
Furthermore, applying Poetiq's harness to Google's Gemini 3.1 Pro improves its
performance by 12.3%. Additionally, similar to our previous success on ARC-AGI, this
optimization allows the smaller, more economical Gemini 3.1 Pro to surpass Google's own
flagship system, Deep Think.
Building upon insights it learned from previous benchmarks (
ARC-AGI,
HLE), our
Meta-System optimized every part of the harness to improve its performance. We optimized
the harness using only the Gemini 3.1 Pro model; the Meta-System accounted for accuracy,
runtime, and memory constraints when designing the harness. The optimized harness
improved Gemini 3.1 Pro results by 12.3% (78.6 to 90.9)
1, overtaking GPT 5.5, which is the best model we tested on this benchmark. Moreover,
when we applied the
same harness without any new optimization to GPT 5.5
itself, it improved its accuracy to
93.9% — surpassing its own previous
best results and pushing the SOTA boundary even further.
Lastly, though Google's own highest-performing model, Gemini Deep Think, is not
accessible via API for verification, we also surpassed its performance. It is important
to note that our performance was reached without any fine-tuning of the
underlying model and without special access to any model's internal activations. Our
Meta-System creates an intelligent harness, through recursive self-improvement, that
requires only standard API access.
Our SOTA results are achieved without any fine-tuning and without special access to
the model's activations. The Poetiq Meta-System creates an intelligent harness,
through recursive self-improvement, that requires only standard API access.
Unlike post-training and fine-tuning, in which every improvement is tied to a specific
model, we can apply our learned harness to any LLM.
Unlike post-training and fine-tuning, in which every improvement is tied to a specific
model, we can apply our learned harness to any LLM. As noted above, our harness
was optimized for Gemini 3.1 Pro, but also significantly improved GPT 5.5 when applied
to it. Another interesting example is Gemini 3 Flash — the harness improved its accuracy
by 10 percentage points, going from 72.3% to 82.3%. This surpassed Gemini 3.1 Pro,
Anthropic's Claude Opus 4.7 and OpenAI's GPT 5.2 High, all much bigger and more expensive models
than Gemini 3 Flash.
Because LCB Pro categorizes challenges by difficulty — Easy, Medium, and Hard — based on
competitive human solve rates, it provides a granular view of capability. Table 1 shows
that our optimized harness consistently outperforms base model families across every category.
| Model | Overall Accuracy | Hard | Medium | Easy |
| Gemini |
| Gemini 3.1 Pro | 78.6% | 7.7% | 64.9% | 94.8% |
| Gemini 3 Deep Think | 88.8% | 53.8% | 86.0% | 94.8% |
| Poetiq Harness w/ Gemini 3.1 Pro | 90.9% | 58.3% | 87.5% | 96.9% |
| GPT |
| GPT 5.5 High | 89.6% | 50.0% | 91.1% | 93.8% |
| Poetiq Harness w/ GPT 5.5 High | 93.9% | 75.0% | 92.9% | 96.9% |
Table 1: Poetiq Harness vs Gemini and GPT models on LCB Pro (25Q2). Reported accuracies
are split according to the benchmark's difficulty category. The Poetiq Harness
outperforms all models in every category.
To further illustrate the benefits of Poetiq's Meta-System, we apply our technique to
popular recent models — both closed source and open weights models. The Leap Frog graphs
below show large jumps in performance for two models (Gemini 3.0 Flash and Kimi K2.6); a
summary of the improvements on all models can be found in the Appendix.
Figure 2: Using Poetiq's harness improves all models tested. Here we focus just on
specific model improvements. Two examples shown: (A) Gemini 3.0 Flash, (B) Kimi K2.6.
Note Kimi K2.6's 30% improvement. See the Appendix for all model improvements.
Why Test on Code?
This is Poetiq's third publicly reported benchmark test. Previously, we showed how we
can improve all models' performance on both
ARC-AGI and
Humanity's Last Exam (HLE). We have been strategic in the benchmarks we attempted; we believe that there are
three critical categories of tasks for LLMs:
- Reasoning challenges: these require the LLM to synthesize provided
information in inventive ways; ARC-AGI stands as the premier example of this
ability.
- Retrieval challenges: these quantify the breadth of knowledge embedded
within a model's weights. HLE serves as a rigorous audit of this, requiring models
to recall precise facts across a vast spectrum of disciplines.
- Coding challenges: as the most pervasive commercial application for AI
today, these tasks meld reasoning and retrieval with the generation of specialized
procedural logic. Achieving state-of-the-art results here demonstrates the potential
economic impact of our approach to recursive self-improvement.
Our coding initiatives focused on three primary objectives:
-
To prove that by constructing an intelligent harness around any underlying LLM, we
can boost efficacy without fine-tuning or special model access. ✓
-
To validate our Meta-System's capacity for recursive self-improvement in creating
this harness. We are proud to state that our system builds and optimizes these
task-specific harnesses fully automatically. ✓
-
To demonstrate that once our harness is built, it is model-agnostic and can
be used, without modification, with any model. ✓
SOTA results were achieved using harnesses that were automatically created and optimized
using Poetiq's Meta-System. Poetiq's Meta-System automatically refines itself using
recursive self-improvement.
So, What's Next?
At Poetiq, our core Meta-System is engineered to automate the extraction of knowledge
for challenging tasks, producing highly optimized agents, harnesses, and orchestrators.
We optimize every part of the process: developing better strategies for determining what
to ask, refining sequential chain-of-questions, and devising fundamental new methods for
assembling the answers. Our Meta-System constantly incorporates its learnings from
previous and current tasks and datasets to automatically create new, custom task-specific harnesses.
Since our emergence from stealth in November 2025, we have publicly demonstrated our
approach on benchmarks requiring reasoning, knowledge retrieval, and advanced tool use
(e.g., in HLE), and now coding. Each exposed unique hurdles in maximizing LLM
performance. So what's next? We are also engaging with a small, select group of early
customers — if you're excited to be one of the first to try Poetiq on your problems, let
us know.
Join the Journey
Poetiq is a lean, deeply technical team with a combined 72 years of experience from
Google/DeepMind. We're focused on solving the fundamental problems of AI reasoning and
knowledge extraction in the presence of noise and uncertainty through recursive
self-improvement. Want to join us?
Check out our open positions.
Appendix
Figure 3: All model improvements on LCB Pro. Notice Kimi's 30% improvement and
Nemotron's 12.8% improvement!
Whenever possible we report accuracy numbers directly from
the LCB Pro leaderboard at
https://livecodebenchpro.com/projects/livecodebench-pro/leaderboard (25Q2). For models not featured on the leaderboard, we conducted our own evaluations.
To validate our experimental setup, we tested several baseline models and successfully
replicated their official leaderboard accuracies.