Recursive Self-Improvement Delivers New SOTA Coding Performance

We ran Poetiq's Meta-System on a coding benchmark, let it construct and optimize its own harnesses from scratch, and delivered improvements across all models tested, open-weights and proprietary. No fine-tuning, no special access, no hand-built pipelines.

Automatically Improving Beyond SOTA on LiveCodeBench Pro

LiveCodeBench Pro (LCB Pro) is an authoritative coding benchmark; for a solution to be successful, it must not only produce the correct answer but also satisfy specific memory and runtime constraints. Importantly, this benchmark is explicitly designed to mitigate LLM data contamination. The testing suite is subject to continuous updates, distinguishing it from many standard benchmarks. It further avoids overfitting by withholding public ground-truth code; instead, it utilizes a comprehensive testing framework that validates generated solutions against required outputs. The problems in the benchmark come from major coding competitions.

LCB Pro emphasizes creative coding by utilizing difficult C++ challenges, which effectively test an AI's capacity for complex problem-solving. This distinguishes it from datasets like SWEBench that evaluate tool usage or bug-fixing workflows. Ultimately, the benchmark provides a pure assessment of the model's inherent programming abilities and its capacity to generate high-quality, performant procedural logic.

We ran Poetiq's Meta-System on LCB Pro, letting it create a custom LCB harness from scratch, optimized for Gemini 3.1 Pro. The harness was then tested on many other models from different providers and different generations, both open source and proprietary. The new State of the Art (SOTA) results are shown in Figure 1.

The Automatically-Learned Poetiq Harness Achieves New SOTA

LiveCodeBench Pro (25Q2)

100

93.9

Poetiq GPT 5.5

90.9

Poetiq Gemini 3.1 Pro

89.6

GPT 5.5

88.8

Gemini 3 Deep Think

80.5

Claude Opus 4.7

78.6

Gemini 3.1 Pro

75.7

GPT 5.2:High

72.3

Gemini 3.0 Flash

64.5

Qwen 3.6 Plus

50.0

Kimi K2.6

40.9

Nemotron 3 Super 120B

Figure 1: Our automatically-created harness achieves a new state-of-the-art (SOTA) on LiveCodeBench Pro (25Q2), outperforming GPT 5.5 High by 4.3% using the same base model. Furthermore, applying Poetiq's harness to Google's Gemini 3.1 Pro improves its performance by 12.3%. Additionally, similar to our previous success on ARC-AGI, this optimization allows the smaller, more economical Gemini 3.1 Pro to surpass Google's own flagship system, Deep Think.

Building upon insights it learned from previous benchmarks (ARC-AGI, HLE), our Meta-System optimized every part of the harness to improve its performance. We optimized the harness using only the Gemini 3.1 Pro model; the Meta-System accounted for accuracy, runtime, and memory constraints when designing the harness. The optimized harness improved Gemini 3.1 Pro results by 12.3% (78.6 to 90.9)¹, overtaking GPT 5.5, which is the best model we tested on this benchmark. Moreover, when we applied the same harness without any new optimization to GPT 5.5 itself, it improved its accuracy to 93.9% — surpassing its own previous best results and pushing the SOTA boundary even further.

Lastly, though Google's own highest-performing model, Gemini Deep Think, is not accessible via API for verification, we also surpassed its performance. It is important to note that our performance was reached without any fine-tuning of the underlying model and without special access to any model's internal activations. Our Meta-System creates an intelligent harness, through recursive self-improvement, that requires only standard API access.

Our SOTA results are achieved without any fine-tuning and without special access to the model's activations. The Poetiq Meta-System creates an intelligent harness, through recursive self-improvement, that requires only standard API access.

Unlike post-training and fine-tuning, in which every improvement is tied to a specific model, we can apply our learned harness to any LLM.

Learn Once, Use Everywhere: Our Solution Works Across All Models Automatically

Unlike post-training and fine-tuning, in which every improvement is tied to a specific model, we can apply our learned harness to any LLM. As noted above, our harness was optimized for Gemini 3.1 Pro, but also significantly improved GPT 5.5 when applied to it. Another interesting example is Gemini 3 Flash — the harness improved its accuracy by 10 percentage points, going from 72.3% to 82.3%. This surpassed Gemini 3.1 Pro, Anthropic's Claude Opus 4.7 and OpenAI's GPT 5.2 High, all much bigger and more expensive models than Gemini 3 Flash.

Because LCB Pro categorizes challenges by difficulty — Easy, Medium, and Hard — based on competitive human solve rates, it provides a granular view of capability. Table 1 shows that our optimized harness consistently outperforms base model families across every category.

Model	Overall Accuracy	Hard	Medium	Easy
Gemini
Gemini 3.1 Pro	78.6%	7.7%	64.9%	94.8%
Gemini 3 Deep Think	88.8%	53.8%	86.0%	94.8%
Poetiq Harness w/ Gemini 3.1 Pro	90.9%	58.3%	87.5%	96.9%
GPT
GPT 5.5 High	89.6%	50.0%	91.1%	93.8%
Poetiq Harness w/ GPT 5.5 High	93.9%	75.0%	92.9%	96.9%

Table 1: Poetiq Harness vs Gemini and GPT models on LCB Pro (25Q2). Reported accuracies are split according to the benchmark's difficulty category. The Poetiq Harness outperforms all models in every category.

To further illustrate the benefits of Poetiq's Meta-System, we apply our technique to popular recent models — both closed source and open weights models. The Leap Frog graphs below show large jumps in performance for two models (Gemini 3.0 Flash and Kimi K2.6); a summary of the improvements on all models can be found in the Appendix.

LCB Pro Leap Frog Graph: Gemini 3.0 Flash

82.3

Poetiq Gemini 3.0 Flash

80.5

Claude Opus 4.7

78.6

Gemini 3.1 Pro

75.7

GPT 5.2:High

72.3

Gemini 3.0 Flash

LCB Pro Leap Frog Graph: Kimi K2.6

79.9

Poetiq Kimi K2.6

78.6

Gemini 3.1 Pro

75.7

GPT 5.2:High

72.3

Gemini 3.0 Flash

64.5

Qwen 3.6 Plus

50.0

Kimi K2.6

Figure 2: Using Poetiq's harness improves all models tested. Here we focus just on specific model improvements. Two examples shown: (A) Gemini 3.0 Flash, (B) Kimi K2.6. Note Kimi K2.6's 30% improvement. See the Appendix for all model improvements.

Why Test on Code?

This is Poetiq's third publicly reported benchmark test. Previously, we showed how we can improve all models' performance on both ARC-AGI and Humanity's Last Exam (HLE). We have been strategic in the benchmarks we attempted; we believe that there are three critical categories of tasks for LLMs:

Reasoning challenges: these require the LLM to synthesize provided information in inventive ways; ARC-AGI stands as the premier example of this ability.
Retrieval challenges: these quantify the breadth of knowledge embedded within a model's weights. HLE serves as a rigorous audit of this, requiring models to recall precise facts across a vast spectrum of disciplines.
Coding challenges: as the most pervasive commercial application for AI today, these tasks meld reasoning and retrieval with the generation of specialized procedural logic. Achieving state-of-the-art results here demonstrates the potential economic impact of our approach to recursive self-improvement.

Our coding initiatives focused on three primary objectives:

To prove that by constructing an intelligent harness around any underlying LLM, we can boost efficacy without fine-tuning or special model access. ✓
To validate our Meta-System's capacity for recursive self-improvement in creating this harness. We are proud to state that our system builds and optimizes these task-specific harnesses fully automatically. ✓
To demonstrate that once our harness is built, it is model-agnostic and can be used, without modification, with any model. ✓

SOTA results were achieved using harnesses that were automatically created and optimized using Poetiq's Meta-System. Poetiq's Meta-System automatically refines itself using recursive self-improvement.

So, What's Next?

At Poetiq, our core Meta-System is engineered to automate the extraction of knowledge for challenging tasks, producing highly optimized agents, harnesses, and orchestrators. We optimize every part of the process: developing better strategies for determining what to ask, refining sequential chain-of-questions, and devising fundamental new methods for assembling the answers. Our Meta-System constantly incorporates its learnings from previous and current tasks and datasets to automatically create new, custom task-specific harnesses.

Since our emergence from stealth in November 2025, we have publicly demonstrated our approach on benchmarks requiring reasoning, knowledge retrieval, and advanced tool use (e.g., in HLE), and now coding. Each exposed unique hurdles in maximizing LLM performance. So what's next? We are also engaging with a small, select group of early customers — if you're excited to be one of the first to try Poetiq on your problems, let us know.

Join the Journey

Poetiq is a lean, deeply technical team with a combined 72 years of experience from Google/DeepMind. We're focused on solving the fundamental problems of AI reasoning and knowledge extraction in the presence of noise and uncertainty through recursive self-improvement. Want to join us? Check out our open positions.

Appendix

LCB Pro — Poetiq Harness Improvements

100

GPT 5.5

89.6%

Gemini 3.1 Pro

78.6%

GPT 5.2:High

75.7%

Gemini 3.0 Flash

72.3%

Qwen 3.6 Plus

64.5%

Kimi K2.6

50.0%

Nemotron 3 Super 120B

40.9%

Poetiq Improvement

Figure 3: All model improvements on LCB Pro. Notice Kimi's 30% improvement and Nemotron's 12.8% improvement!

¹ Whenever possible we report accuracy numbers directly from the LCB Pro leaderboard at https://livecodebenchpro.com/projects/livecodebench-pro/leaderboard (25Q2). For models not featured on the leaderboard, we conducted our own evaluations. To validate our experimental setup, we tested several baseline models and successfully replicated their official leaderboard accuracies.

get in touch

hello@poetiq.ai