WolfBench (2026-03-05)

Wolfram Ravenwolf’s Four-Metric Framework · based on Terminal-Bench 2.0

One score is not enough.
Because performance is a distribution, not a point.

Most benchmarks report just a single average. WolfBench shows four metrics: the rock-solid base you can always count on, the average you can expect, the best a single run achieved, and the ceiling of what’s theoretically possible. The spread between them tells you how consistent – or how unpredictable – an AI agent really is.
Learn more ↓

T2 = Terminus-2CC = Claude CodeOC = OpenClaw
▲ Ceiling (ever solved)★ Best-of (peak run)∅ Average (mean score)■ Solid (always solved)
0%10%20%30%40%50%60%70%80%90%100%
Claude Opus 4.6
T2
5R
55% 73% 75% 88%
CC
5R
46% 64% 69% 80%
OC
5R
33% 51% 56% 64%
Claude Sonnet 4.6
T2
5R
42% 61% 63% 81%
CC
5R
37% 56% 60% 75%
OC
10R
29% 53% 58% 74%
GLM-5 FP8
T2
1R
51% 51% 51% 51%
Kimi K2.5
T2
6R
26% 49% 52% 66%
OC
5R
10% 32% 35% 57%
MiniMax M2.5
T2
5R
27% 47% 51% 64%
OC
5R
0% 36% 45% 60%

About WolfBench

by Wolfram Ravenwolf – who evaluates models for breakfast, builds agents at night, and preaches AI usefulness all day long.

AI agents are becoming essential tools. Every week, a new model comes out and claims to be “the best at coding” or “SOTA on agentic tasks.” But what does that actually mean for you – the person who’s going to throw real work at these things?

A single score tells you almost nothing.

Most benchmarks give you one number: “Model X scored 42% on Benchmark Y.” Great. But can you rely on it? Was that a lucky run? Would it score the same tomorrow? What’s the floor – the tasks it always nails? What’s the ceiling – what it could do if the stars align?

WolfBench exists because I got tired of incomplete leaderboards. I wanted to know which model, which harness, and which settings actually deliver the best results on real agentic tasks – not just on paper, but in practice, consistently, across multiple runs.

What is it?

WolfBench is my evaluation framework built on top of Terminal-Bench 2.0, a popular agentic benchmark consisting of 89 diverse real-world tasks. These aren’t just coding puzzles. They span the kind of work you’d actually ask an AI agent to do:

The key word is agentic: these tasks require the model to plan, execute shell commands, inspect results, debug failures, and iterate – just like a human developer or sysadmin would. No multiple-choice shortcuts. No toy puzzles. Real work in real sandboxed environments.

What makes WolfBench different?

The Four-Metric Framework

Performance is a distribution, not a point. One number can’t capture what an AI agent is truly capable of. Four numbers get a lot closer.

▲ Ceiling: What’s theoretically possible?

The union of all tasks ever solved across all runs. If the model solved task A in run 3 and task B in run 5 (but never both in the same run), both count toward the ceiling.

It tells you the theoretical maximum performance this model is capable of within a given harness and settings – even if no single run achieves it. It reveals variance-limited tasks: solvable, but not reliably.

★ Best-of: What’s the peak in a single run?

The highest score from any individual run.

This is the “marketing number” – but with context. The closer the best-of is to the average, the more consistent the model performs. A large gap between best-of and average means you’re rolling dice every time you run it.

∅ Average: What can you normally expect?

The mean score across all valid runs (e.g., 5 or more replicates).

This is the most commonly reported metric – and it is useful, but only with enough runs to be stable. With a single run? It’s a coin flip.

■ Solid: What does it always get right?

Tasks that the model solves across all runs – the rock-solid base with zero variance.

The higher the solid base, the more dependable the agent is. These are the tasks you can confidently delegate and expect success every time. A model with a high solid base and moderate average is often more reliable in practice than one with a high average but low solid base – because you know what you’re getting.

Reading the Chart

The four metrics stack vertically for each model/configuration. The spread between them tells you as much as the numbers themselves:

The Bottom Line

Performance is more complex than a single average score – and the decisions you make based on benchmarks deserve better data than that. WolfBench gives you four angles on every model and configuration, so you can form a more complete and realistic judgement of what an AI agent will actually deliver when you put it to work.

Because at the end of the day, you don’t just want to know which model scored the highest. You want to know which one you can trust.

What’s Next

I will continuously add models and agents to the chart, publish the traces and evals on W&B Weave, and release regular blog posts detailing interesting and insightful findings.

This benchmark offers enormous potential for discovery. For instance: Why does Sonnet currently outperform Opus so significantly with OpenClaw? How does Claude Code fare when running a GPT or Gemini model compared to running directly with Opus or Sonnet – or Codex with Claude or Gemini? Is a “cheap” model actually cost-effective if it consumes far more tokens than a more expensive alternative? How does quantization affect performance of local models in agentic tasks?

So many possibilities for analysis – and for posting about it! Stay tuned – and if you want to be the first to know when new results come in, follow me on X and LinkedIn.

Inference sponsored by CoreWeave: The Essential Cloud for AI.
Sandbox compute by Daytona – Secure Infrastructure for Running AI-Generated Code.
Built with Harbor for orchestration, Terminal-Bench 2.0 for tasks, and W&B Weave for tracking.
Charts and dashboards generated with marimo notebooks.