WolfBench

Wolfram Ravenwolf’s Five-Metric Framework · based on Terminal-Bench 2.0

One score is not enough.
Because performance is a distribution, not a point.

Most benchmarks report a single average. WolfBench shows five metrics that tell the full story – from the rock-solid base of tasks solved every time, through the average, up to the ceiling of everything ever solved – plus the best and worst single runs that frame the spread. Together, they reveal what no single number can: how consistent an AI agent truly is.
Learn more ↓

★ Ceiling (ever solved)▲ Best-of (peak run)∅ Average (mean score)▼ Worst-of (lowest run)■ Solid (always solved)

📸

👁

GPT-5.5Gemini 3.5 FlashGLM-5.2 [Z.ai]Claude Opus 4.7Claude Opus 4.6GLM-5.2 [W&B]GPT-5.4Claude Sonnet 5Claude Fable 5 – 13 refusalsClaude Sonnet 4.6Kimi K2.6 [W&B]Kimi K2.6 [Moonshot AI]Kimi K2.7 Code [W&B]DeepSeek-V4-Pro [W&B]MiniMax M2.7Gemini 3.1 Pro PreviewDeepSeek-V4-Flash [W&B]NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]Kimi K2.5 (int4) [W&B]GLM-5-TurboKimi K2.5 (nvfp4) [W&B]GLM-5-FP8 [W&B]MiniMax M2.5 [W&B]Gemini 3 Flash PreviewGLM-5.1 [W&B]GPT-5.3-CodexNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]Gemma 4 31B [W&B]GPT‑5.4 miniGemini 3.1 Flash Lite PreviewMistral Small 4 119B A6BGPT‑5.4 nano

⇅

Terminus-2Claude CodeHermes AgentOpenClawCursorCodex

0%10%20%30%40%50%60%70%80%90%100%

GPT-5.5

🧠 xhigh

■60% ▼74% ∅77% ▲80% ★92%

🧠 medium

■49% ▼67% ∅69% ▲72% ★87%

v2026.4.23
🧠 xhigh

■61% ▼72% ∅74% ▲78% ★85%

v2026.4.23
🧠 medium

■55% ▼71% ∅73% ▲75% ★85%

2026.4.23
🧠 off

■52% ▼65% ∅70% ▲74% ★88%

2026.04.17
🧠 high

■64% ▼73% ∅77% ▲80% ★90%

0.125.0
🧠 high

■61% ▼78% ∅79% ▲80% ★91%

Gemini 3.5 Flash

🧠 high

■53% ▼69% ∅71% ▲75% ★89%

🧠 medium

■51% ▼60% ∅71% ▲79% ★87%

🧠 low

■47% ▼61% ∅66% ▲70% ★80%

v2026.5.16
🧠 high

■49% ▼69% ∅70% ▲72% ★87%

v2026.5.16
🧠 medium

■51% ▼66% ∅70% ▲75% ★87%

v2026.5.16
🧠 low

■45% ▼58% ∅64% ▲67% ★82%

2026.4.23
🧠 high

■55% ▼65% ∅72% ▲81% ★92%

2026.4.23
🧠 low

■49% ▼66% ∅69% ▲73% ★85%

2026.4.23
🧠 medium

■49% ▼63% ∅68% ▲73% ★83%

GLM-5.2 [Z.ai]

🧠 max

■61% ▼70% ∅71% ▲72% ★83%

🧠 high

■47% ▼65% ∅69% ▲73% ★83%

v2026.6.19
🧠 max

■46% ▼62% ∅65% ▲70% ★82%

Claude Opus 4.7

🧠 off

■57% ▼71% ∅71% ▲73% ★80%

2.1.112
🧠 xhigh

■56% ▼73% ∅73% ▲74% ★87%

v2026.3.30
🧠 off

■44% ▼63% ∅66% ▲71% ★83%

2026.3.11
🧠 off

■54% ▼67% ∅75% ▲81% ★91%

Claude Opus 4.6

🧠 off

■55% ▼69% ∅71% ▲75% ★84%

🧠 max

■46% ▼55% ∅59% ▲62% ★74%

2.1.63
🧠 high

■45% ▼58% ∅63% ▲67% ★81%

2.1.75
🧠 max

■44% ▼57% ∅60% ▲61% ★72%

v2026.3.30
🧠 off

■49% ▼62% ∅64% ▲67% ★83%

2026.3.1
🧠 medium

■42% ▼56% ∅58% ▲58% ★74%

2026.3.11
🧠 max

■39% ▼53% ∅57% ▲60% ★74%

2026.04.16
🧠 high

■44% ▼57% ∅63% ▲67% ★82%

GLM-5.2 [W&B]

■56% ▼69% ∅70% ▲73% ★81%

v2026.6.19

■51% ▼67% ∅70% ▲73% ★83%

GPT-5.4

🧠 xhigh

■48% ▼64% ∅69% ▲73% ★83%

🧠 off

■28% ▼42% ∅44% ▲47% ★61%

v2026.3.30
🧠 medium

■47% ▼64% ∅66% ▲71% ★83%

2026.3.11
🧠 xhigh

■52% ▼70% ∅71% ▲72% ★85%

2026.3.11
🧠 low

■45% ▼57% ∅61% ▲66% ★76%

2026.3.1
🧠 off

■9% ▼28% ∅30% ▲33% ★52%

Claude Sonnet 5

🧠 high

69%

🧠 max

■55% ▼65% ∅68% ▲70% ★78%

2.1.197

74%

2026.3.1
🧠 high

62%

Claude Fable 5 – 13 refusals

🧠 high

■55% ▼63% ∅68% ▲72% ★78%

Claude Sonnet 4.6

■42% ▼60% ∅62% ▲64% ★81%

2.1.63

■40% ▼54% ∅58% ▲63% ★75%

2026.3.1

■36% ▼48% ∅53% ▲56% ★70%

Kimi K2.6 [W&B]

■46% ▼54% ∅60% ▲66% ★71%

v2026.3.30

■33% ▼48% ∅56% ▲62% ★73%

Kimi K2.6 [Moonshot AI]

■39% ▼55% ∅59% ▲63% ★73%

v2026.3.30

■13% ▼20% ∅47% ▲64% ★72%

2026.3.11

■42% ▼54% ∅59% ▲63% ★72%

Kimi K2.7 Code [W&B]

🧠 off

■38% ▼54% ∅58% ▲61% ★73%

DeepSeek-V4-Pro [W&B]

■45% ▼55% ∅57% ▲60% ★70%

v2026.3.30

■17% ▼29% ∅35% ▲42% ★54%

MiniMax M2.7

■31% ▼47% ∅52% ▲55% ★66%

2026.3.11

■27% ▼42% ∅46% ▲49% ★65%

Gemini 3.1 Pro Preview

■30% ▼48% ∅52% ▲56% ★69%

2026.3.11

■39% ▼56% ∅59% ▲63% ★74%

DeepSeek-V4-Flash [W&B]

■31% ▼48% ∅51% ▲53% ★66%

v2026.3.30

■20% ▼42% ∅43% ▲46% ★65%

NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]

■31% ▼48% ∅49% ▲52% ★66%

v2026.5.16

■24% ▼40% ∅42% ▲44% ★56%

Kimi K2.5 (int4) [W&B]

■31% ▼46% ∅48% ▲52% ★63%

2026.3.1

■13% ▼35% ∅39% ▲45% ★58%

GLM-5-Turbo

■42% ▼46% ∅48% ▲49% ★54%

2026.3.11

■26% ▼44% ∅47% ▲49% ★70%

Kimi K2.5 (nvfp4) [W&B]

■29% ▼46% ∅47% ▲49% ★64%

v2026.3.30

■22% ▼39% ∅41% ▲45% ★58%

2026.3.1

■15% ▼34% ∅37% ▲38% ★61%

GLM-5-FP8 [W&B]

■28% ▼44% ∅47% ▲52% ★63%

2026.3.11

■17% ▼31% ∅37% ▲39% ★53%

MiniMax M2.5 [W&B]

■27% ▼42% ∅47% ▲51% ★60%

2026.3.11

■20% ▼33% ∅37% ▲43% ★54%

Gemini 3 Flash Preview

■24% ▼42% ∅44% ▲48% ★64%

2026.3.11

■22% ▼36% ∅41% ▲46% ★60%

GLM-5.1 [W&B]

■28% ▼39% ∅42% ▲47% ★58%

v2026.3.30

■29% ▼42% ∅44% ▲47% ★60%

2026.3.11

■12% ▼26% ∅33% ▲39% ★55%

GPT-5.3-Codex

■22% ▼38% ∅39% ▲42% ★57%

2026.3.1

■39% ▼54% ∅55% ▲56% ★73%

NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]

■18% ▼31% ∅36% ▲39% ★53%

2026.3.1

■8% ▼17% ∅20% ▲24% ★38%

Gemma 4 31B [W&B]

■19% ▼30% ∅31% ▲33% ★45%

2026.3.11

■8% ▼17% ∅18% ▲19% ★27%

GPT‑5.4 mini

■17% ▼26% ∅26% ▲27% ★36%

2026.3.11

■4% ▼10% ∅14% ▲18% ★28%

Gemini 3.1 Flash Lite Preview

■10% ▼21% ∅25% ▲28% ★42%

2026.3.11

■11% ▼20% ∅23% ▲26% ★38%

Mistral Small 4 119B A6B

■16% ▼21% ∅24% ▲26% ★33%

2026.3.11

■10% ▼16% ∅17% ▲18% ★25%

GPT‑5.4 nano

■9% ▼20% ∅22% ▲24% ★37%

2026.3.11

■7% ▼12% ∅14% ▲17% ★24%

Run Details (438 runs)

Across these 438 runs, 88 (99%) of the 89 tasks were solved at least once, 0 (0%) were solved every time, and 1 (1%) were never solved.

Total runtime: 34d 15h 52m · Total tokens: 35.08B (34.25B in, 833.5M out) · Total cost: $39,852.01 (cost data for 436/438 runs).

Date	Agent	Provider	Vendor	Model	Think	Score	Pass	Fail	Timeout	Timeouts	Err	Duration	In	Out	Total	Cost
2026-07-01 18:05	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 5	max	67.4%	60	29	3600s	18	0	2h08m	77.1M	8.9M	85.9M	$171.49
2026-07-01 15:59	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 5	max	67.4%	60	29	3600s	17	0	2h05m	92.7M	8.6M	101.3M	$173.15
2026-07-01 13:47	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 5	max	65.2%	58	31	3600s	18	0	2h11m	94.2M	9.2M	103.4M	$181.80
2026-07-01 11:39	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 5	max	69.7%	62	27	3600s	18	0	2h08m	87.5M	8.9M	96.4M	$176.51
2026-07-01 09:31	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 5	max	68.5%	61	28	3600s	18	0	2h07m	86.6M	8.9M	95.5M	$174.40
2026-06-30 21:46	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 5	high	61.8%	55	34	3600s	9	0	1h28m	7K	4.1M	4.1M	-
2026-06-30 21:46	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 5	high	68.5%	61	27	3600s	5	1	1h40m	74.1M	3.8M	77.9M	$91.08
2026-06-30 21:46	Claude Code (2.1.197)	anthropic	anthropic	Claude Sonnet 5	-	74.2%	66	23	3600s	4	0	1h04m	228.1M	2.7M	230.8M	-
2026-06-26 20:05	Hermes Agent (v2026.6.19)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	66.3%	59	30	3600s	1	0	1h38m	3.8M	2.5M	6.4M	$46.00
2026-06-26 18:32	Hermes Agent (v2026.6.19)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	65.2%	58	31	3600s	1	0	1h32m	4.2M	2.5M	6.8M	$45.21
2026-06-26 17:03	Hermes Agent (v2026.6.19)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	69.7%	62	27	3600s	1	0	1h28m	4.3M	2.4M	6.7M	$42.20
2026-06-26 15:13	Hermes Agent (v2026.6.19)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	61.8%	55	34	3600s	3	0	1h49m	3.8M	2.5M	6.3M	$46.18
2026-06-26 13:42	Hermes Agent (v2026.6.19)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	64.0%	57	32	3600s	4	0	1h30m	4.0M	2.5M	6.5M	$47.61
2026-06-26 09:06	Hermes Agent (v2026.6.19)	wandb	zai-org	GLM-5.2 [W&B]	-	68.5%	61	28	3600s	3	0	1h39m	51.8M	3.4M	55.2M	$105.42
2026-06-26 06:40	Hermes Agent (v2026.6.19)	wandb	zai-org	GLM-5.2 [W&B]	-	68.5%	61	27	3600s	2	1	2h25m	43.4M	2.8M	46.2M	$88.60
2026-06-26 05:07	Hermes Agent (v2026.6.19)	wandb	zai-org	GLM-5.2 [W&B]	-	67.4%	60	29	3600s	2	0	1h32m	62.0M	3.5M	65.5M	$121.18
2026-06-26 03:28	Hermes Agent (v2026.6.19)	wandb	zai-org	GLM-5.2 [W&B]	-	71.9%	64	25	3600s	1	0	1h39m	52.2M	3.1M	55.3M	$104.24
2026-06-26 02:04	Hermes Agent (v2026.6.19)	wandb	zai-org	GLM-5.2 [W&B]	-	73.0%	65	24	3600s	1	0	1h22m	43.8M	2.9M	46.6M	$90.07
2026-06-25 01:14	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	71.9%	64	25	3600s	14	0	3h13m	37.4M	4.8M	42.1M	$34.72
2026-06-24 22:19	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	71.9%	64	25	3600s	14	0	2h54m	49.1M	5.6M	54.7M	$40.86
2026-06-24 19:33	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	71.9%	64	25	3600s	12	0	2h45m	70.3M	6.1M	76.3M	$49.35
2026-06-24 16:58	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	71.9%	64	25	3600s	10	0	2h34m	44.1M	5.2M	49.4M	$38.27
2026-06-24 13:48	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	max	69.7%	62	27	3600s	15	0	3h09m	40.4M	4.9M	45.3M	$36.05
2026-06-24 06:22	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.2 [W&B]	-	73.0%	65	24	3600s	9	0	1h43m	70.6M	7.0M	77.6M	$86.17
2026-06-24 04:39	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.2 [W&B]	-	68.5%	61	28	3600s	8	0	1h42m	58.2M	7.0M	65.1M	$77.86
2026-06-24 03:02	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.2 [W&B]	-	69.7%	62	27	3600s	8	0	1h37m	59.3M	6.0M	65.3M	$77.49
2026-06-24 01:18	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.2 [W&B]	-	68.5%	61	27	3600s	7	1	1h43m	65.8M	6.2M	72.0M	$74.90
2026-06-23 18:31	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.2 [W&B]	-	68.5%	61	28	3600s	7	0	1h42m	64.3M	6.2M	70.6M	$75.13
2026-06-23 00:34	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	high	69.7%	62	27	3600s	6	0	2h08m	55.3M	3.5M	58.8M	$33.79
2026-06-22 00:32	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	high	65.2%	58	31	3600s	11	0	1h47m	59.7M	4.0M	63.7M	$37.45
2026-06-21 22:29	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	high	68.5%	61	28	3600s	10	0	2h02m	61.9M	4.0M	65.9M	$38.23
2026-06-21 20:24	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	high	73.0%	65	24	3600s	4	0	1h46m	48.8M	3.8M	52.6M	$32.72
2026-06-21 12:13	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5.2 [Z.ai]	high	66.3%	59	30	3600s	13	0	2h18m	73.0M	4.1M	77.2M	$42.53
2026-06-17 12:30	Codex (0.125.0)	openai	openai	GPT-5.5	high	78.7%	70	19	3600s	1	0	3h17m	116.8M	927K	117.7M	$114.02
2026-06-15 14:43	Terminus-2 (2.0.0)	openai	moonshotai	Kimi K2.7 Code [W&B]	off	59.6%	53	36	3600s	21	0	3h19m	59.5M	2.1M	61.6M	$37.45
2026-06-15 11:46	Terminus-2 (2.0.0)	openai	moonshotai	Kimi K2.7 Code [W&B]	off	57.3%	51	38	3600s	16	0	2h56m	51.1M	2.0M	53.1M	$32.47
2026-06-15 08:20	Terminus-2 (2.0.0)	openai	moonshotai	Kimi K2.7 Code [W&B]	off	53.9%	48	41	3600s	26	0	3h26m	47.7M	2.2M	50.0M	$33.61
2026-06-15 05:18	Terminus-2 (2.0.0)	openai	moonshotai	Kimi K2.7 Code [W&B]	off	60.7%	54	35	3600s	19	0	3h00m	49.8M	2.0M	51.8M	$32.40
2026-06-15 01:59	Terminus-2 (2.0.0)	openai	moonshotai	Kimi K2.7 Code [W&B]	off	57.3%	51	38	3600s	20	0	3h19m	67.0M	2.2M	69.2M	$37.56
2026-06-10 23:07	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Fable 5 – 13 refusals	high	62.9%	56	33	3600s	17	0	1h35m	201.1M	1.5M	202.6M	$2,425.18
2026-06-10 20:20	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Fable 5 – 13 refusals	high	66.3%	59	30	3600s	15	0	1h21m	207.4M	1.4M	208.8M	$2,536.92
2026-06-10 17:21	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Fable 5 – 13 refusals	high	69.7%	62	27	3600s	17	0	1h24m	210.7M	1.5M	212.2M	$2,573.59
2026-06-10 13:41	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Fable 5 – 13 refusals	high	71.9%	64	25	3600s	16	0	1h24m	82.8M	1.4M	84.2M	$968.05
2026-06-10 02:21	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Fable 5 – 13 refusals	high	67.4%	60	29	3600s	16	0	1h40m	212.9M	1.3M	214.3M	$2,577.38
2026-06-05 13:31	Hermes Agent (v2026.5.16)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	41.6%	37	52	3600s	7	0	1h37m	188.6M	2.1M	190.7M	$147.18
2026-06-05 11:51	Hermes Agent (v2026.5.16)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	40.4%	36	53	3600s	6	0	1h40m	166.4M	2.1M	168.5M	$130.49
2026-06-05 10:03	Hermes Agent (v2026.5.16)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	43.8%	39	50	3600s	9	0	1h46m	228.4M	2.3M	230.7M	$177.61
2026-06-05 08:22	Hermes Agent (v2026.5.16)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	40.4%	36	53	3600s	7	0	1h41m	185.7M	2.1M	187.8M	$145.02
2026-06-05 06:32	Hermes Agent (v2026.5.16)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	42.7%	38	51	3600s	17	0	1h50m	160.2M	1.8M	162.1M	$125.26
2026-06-05 05:16	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	50.6%	45	44	3600s	7	0	1h15m	140.6M	3.2M	143.8M	$114.27
2026-06-05 03:59	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	48.3%	43	46	3600s	8	0	1h16m	111.7M	3.0M	114.6M	$91.95
2026-06-05 02:46	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	48.3%	43	46	3600s	11	0	1h13m	140.0M	3.5M	143.5M	$114.60
2026-06-05 01:31	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	51.7%	46	43	3600s	8	0	1h14m	103.5M	3.1M	106.7M	$86.31
2026-06-04 23:52	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Ultra-550B-A55B [W&B]	-	48.3%	43	46	3600s	10	0	1h39m	139.0M	3.7M	142.7M	$114.50
2026-05-28 01:28	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	73.0%	65	23	3600s	6	1	2h12m	20.4M	1.8M	22.2M	$74.49
2026-05-27 22:50	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	66.3%	59	30	3600s	4	0	2h37m	17.7M	1.8M	19.5M	$66.36
2026-05-27 19:09	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	68.5%	61	28	3600s	5	0	3h40m	18.9M	1.7M	20.6M	$65.41
2026-05-27 16:15	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	71.9%	64	24	3600s	7	1	2h54m	26.2M	1.8M	27.9M	$86.24
2026-05-27 13:11	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	67.4%	60	29	3600s	7	0	3h03m	30.7M	2.4M	33.1M	$116.35
2026-05-27 12:05	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	low	66.3%	59	30	3600s	2	0	1h06m	339.1M	2.1M	341.2M	$109.98
2026-05-27 10:59	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	low	62.9%	56	33	3600s	1	0	1h06m	215.3M	1.8M	217.1M	$84.32
2026-05-27 09:58	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	low	69.7%	62	27	3600s	2	0	1h01m	263.6M	2.1M	265.6M	$97.72
2026-05-27 08:52	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	low	69.7%	62	26	3600s	2	1	1h06m	237.5M	2.1M	239.6M	$95.52
2026-05-27 07:50	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	low	60.7%	54	35	3600s	4	0	1h01m	378.3M	2.2M	380.5M	$120.31
2026-05-27 06:24	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	65.2%	58	31	3600s	0	0	1h25m	23.7M	625K	24.3M	$57.54
2026-05-27 05:16	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	58.4%	52	37	3600s	0	0	1h08m	22.7M	555K	23.2M	$53.56
2026-05-27 03:56	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	67.4%	60	29	3600s	0	0	1h19m	22.3M	504K	22.8M	$55.11
2026-05-27 02:44	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	62.9%	56	33	3600s	0	0	1h12m	24.4M	597K	25.0M	$60.59
2026-05-27 01:35	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	66.3%	59	30	3600s	0	0	1h08m	24.8M	553K	25.4M	$56.83
2026-05-26 13:31	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	70.8%	63	26	3600s	4	0	2h28m	24.3M	1.9M	26.2M	$82.81
2026-05-26 10:25	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	73.0%	65	23	3600s	6	1	3h05m	26.1M	1.8M	28.0M	$93.17
2026-05-26 06:45	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	80.9%	72	17	3600s	8	0	3h40m	29.6M	1.9M	31.5M	$95.74
2026-05-26 04:22	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	71.9%	64	25	3600s	5	0	2h22m	16.9M	1.8M	18.7M	$65.08
2026-05-26 00:39	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	65.2%	58	31	3600s	5	0	3h42m	21.9M	1.8M	23.7M	$81.09
2026-05-25 09:27	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	68.5%	61	28	3600s	1	0	1h06m	287.3M	2.9M	290.2M	$114.19
2026-05-25 08:22	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	69.7%	62	27	3600s	4	0	1h05m	421.7M	3.9M	425.5M	$154.30
2026-05-25 07:15	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	69.7%	62	27	3600s	2	0	1h06m	259.6M	3.1M	262.7M	$112.24
2026-05-25 05:40	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	69.7%	62	27	3600s	0	0	1h35m	26.9M	751K	27.7M	$69.57
2026-05-25 04:13	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	70.8%	63	26	3600s	0	0	1h27m	30.0M	728K	30.7M	$75.54
2026-05-25 02:55	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	70.8%	63	26	3600s	0	0	1h17m	30.8M	774K	31.6M	$75.25
2026-05-25 01:28	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	68.5%	61	28	3600s	0	0	1h27m	29.8M	682K	30.5M	$73.54
2026-05-25 00:56	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	74.2%	66	23	3600s	0	0	0h31m	214.5M	2.8M	217.3M	$95.44
2026-05-24 16:15	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	71.9%	64	25	3600s	0	0	1h33m	28.0M	724K	28.7M	$72.26
2026-05-24 15:09	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	75.3%	67	22	3600s	3	0	1h06m	449.5M	3.0M	452.5M	$144.63
2026-05-24 01:13	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	66.3%	59	30	3600s	5	0	2h13m	19.5M	1.5M	21.0M	$67.27
2026-05-23 22:30	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	62.9%	56	32	3600s	7	1	2h43m	25.8M	2.0M	27.7M	$95.53
2026-05-23 19:46	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	73.0%	65	24	3600s	7	0	2h44m	21.0M	1.7M	22.6M	$67.31
2026-05-23 17:25	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	71.9%	64	25	3600s	6	0	2h21m	25.4M	1.9M	27.3M	$85.94
2026-05-23 15:04	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	66.3%	59	30	3600s	5	0	2h20m	19.1M	1.5M	20.6M	$62.51
2026-05-21 10:01	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	75.3%	67	21	3600s	0	1	2h08m	26.3M	764K	27.1M	$67.50
2026-05-21 07:56	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	67.4%	60	29	3600s	0	0	2h04m	26.8M	742K	27.5M	$68.33
2026-05-21 05:55	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	70.8%	63	26	3600s	0	0	2h00m	29.4M	716K	30.1M	$75.99
2026-05-21 03:55	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	66.3%	59	28	3600s	0	2	1h59m	24.5M	629K	25.1M	$64.14
2026-05-21 01:55	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	69.7%	62	27	3600s	0	0	1h59m	28.1M	715K	28.8M	$69.51
2026-05-20 13:37	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	59.6%	53	34	3600s	1	2	1h10m	285.1M	2.4M	287.5M	$103.07
2026-05-20 03:05	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	70.8%	63	23	3600s	2	3	1h11m	404.5M	3.4M	407.8M	$137.64
2026-05-20 02:04	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	71.9%	64	23	3600s	0	2	1h00m	171.7M	2.3M	174.0M	$84.20
2026-05-19 22:53	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	71.9%	64	23	3600s	2	2	2h00m	357.2M	2.9M	360.0M	$121.81
2026-05-19 19:27	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	78.7%	70	18	3600s	2	1	1h05m	263.7M	2.5M	266.2M	$105.02
2026-05-19 00:29	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	60.7%	54	34	3600s	9	1	1h30m	123.7M	4.6M	128.4M	$66.36
2026-05-16 12:03	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	57.3%	51	37	3600s	8	1	1h01m	146.0M	5.9M	151.9M	$85.19
2026-05-16 09:02	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	59.6%	53	36	3600s	5	0	3h00m	28.9M	5.1M	34.0M	$55.68
2026-05-16 08:01	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	62.9%	56	32	3600s	8	1	1h01m	123.1M	5.9M	129.0M	$85.20
2026-05-16 04:31	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	48.3%	43	46	3600s	5	0	3h29m	30.9M	5.7M	36.6M	$59.54
2026-05-16 03:29	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	53.9%	48	41	3600s	10	0	1h01m	135.9M	5.1M	141.0M	$76.43
2026-05-16 00:40	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	60.7%	54	35	3600s	6	0	2h49m	28.4M	4.9M	33.3M	$54.36
2026-05-15 23:39	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	66.3%	59	29	3600s	7	1	1h01m	120.0M	5.4M	125.4M	$78.96
2026-05-15 20:38	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	49.4%	44	45	3600s	4	0	3h00m	30.8M	4.8M	35.6M	$56.17
2026-05-15 13:10	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	61.8%	55	34	3600s	5	0	3h08m	31.4M	5.7M	37.1M	$61.07
2026-05-14 12:49	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	59.6%	53	34	3600s	19	2	1h42m	105.3M	2.0M	107.3M	$110.96
2026-05-14 09:14	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	32.6%	29	59	3600s	9	1	3h34m	25.9M	798K	26.7M	$53.45
2026-05-14 07:01	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	56.2%	50	38	3600s	22	1	2h12m	98.1M	1.9M	100.0M	$104.78
2026-05-14 03:32	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	41.6%	37	52	3600s	15	0	3h29m	23.3M	761K	24.0M	$48.77
2026-05-14 01:42	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	55.1%	49	40	3600s	21	0	1h50m	91.1M	1.9M	93.0M	$94.09
2026-05-13 22:37	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	34.8%	31	58	3600s	6	0	3h04m	22.2M	603K	22.8M	$45.65
2026-05-13 20:18	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	55.1%	49	38	3600s	30	2	2h19m	76.9M	1.7M	78.6M	$61.30
2026-05-13 16:48	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	29.2%	26	56	3600s	6	7	3h29m	12.5M	422K	12.9M	$26.37
2026-05-13 14:25	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	59.6%	53	32	3600s	20	4	2h14m	96.6M	2.0M	98.6M	$107.07
2026-05-13 11:19	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	34.8%	31	51	3600s	4	7	2h49m	25.9M	793K	26.7M	$53.48
2026-05-12 16:25	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	41.6%	37	52	3600s	3	0	4h15m	20.8M	973K	21.7M	$6.67
2026-05-12 00:12	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	52.8%	47	38	3600s	14	4	1h38m	159.8M	2.7M	162.4M	$18.69
2026-05-11 11:10	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	48.3%	43	45	3600s	18	1	1h56m	162.6M	2.8M	165.4M	$19.04
2026-05-11 07:04	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	42.7%	38	50	3600s	7	1	4h05m	30.8M	1.2M	32.0M	$8.98
2026-05-11 05:15	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	48.3%	43	45	3600s	13	1	1h49m	149.3M	2.6M	151.9M	$17.73
2026-05-11 01:38	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	42.7%	38	50	3600s	3	1	3h37m	26.3M	1.1M	27.4M	$7.86
2026-05-10 23:22	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	51.7%	46	42	3600s	11	1	2h15m	187.5M	2.8M	190.4M	$21.30
2026-05-10 19:55	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	43.8%	39	49	3600s	6	1	3h27m	30.6M	1.2M	31.8M	$8.65
2026-05-10 18:16	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	51.7%	46	42	3600s	14	1	1h38m	163.9M	2.5M	166.4M	$19.61
2026-05-10 13:47	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	46.1%	41	46	3600s	3	2	4h28m	26.3M	1.1M	27.4M	$8.35
2026-04-30 07:51	Codex (0.125.0)	openai	openai	GPT-5.5	high	79.8%	71	18	3600s	1	0	1h53m	84.7M	833K	85.6M	$214.93
2026-04-30 05:39	Codex (0.125.0)	openai	openai	GPT-5.5	high	77.5%	69	20	3600s	1	0	2h11m	106.9M	919K	107.8M	$233.29
2026-04-30 04:17	Codex (0.125.0)	openai	openai	GPT-5.5	high	79.8%	71	18	3600s	1	0	1h21m	115.6M	873K	116.5M	$207.62
2026-04-30 02:51	Codex (0.125.0)	openai	openai	GPT-5.5	high	77.5%	69	20	3600s	1	0	1h25m	96.3M	850K	97.2M	$198.82
2026-04-27 06:24	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	73.0%	65	24	3600s	1	0	1h07m	6.4M	1.2M	7.6M	$116.28
2026-04-27 05:16	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	77.5%	69	20	3600s	1	0	1h08m	6.1M	1.2M	7.3M	$110.16
2026-04-27 04:09	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	71.9%	64	25	3600s	1	0	1h06m	6.9M	1.2M	8.2M	$120.44
2026-04-27 03:00	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	74.2%	66	23	3600s	1	0	1h08m	6.1M	1.3M	7.3M	$115.23
2026-04-27 01:53	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	74.2%	66	23	3600s	1	0	1h07m	8.7M	1.2M	9.9M	$126.55
2026-04-26 11:02	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	79.8%	71	18	3600s	1	0	1h06m	50.1M	705K	50.8M	$58.29
2026-04-26 10:17	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	74.2%	66	23	3600s	0	0	0h44m	51.8M	755K	52.5M	$63.74
2026-04-26 08:18	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	79.8%	71	17	3600s	2	1	1h59m	47.3M	659K	48.0M	$54.79
2026-04-26 07:00	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	70.8%	63	26	3600s	1	0	1h05m	8.0M	626K	8.6M	$89.35
2026-04-26 06:21	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	79.8%	71	18	3600s	1	0	1h56m	44.3M	665K	44.9M	$55.17
2026-04-26 05:55	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	70.8%	63	26	3600s	1	0	1h04m	5.4M	635K	6.0M	$78.26
2026-04-26 04:59	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	75.3%	67	22	3600s	0	0	0h55m	8.5M	617K	9.1M	$91.84
2026-04-26 04:05	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	71.9%	64	25	3600s	0	0	0h53m	7.4M	613K	8.0M	$83.20
2026-04-26 03:34	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	73.0%	65	23	3600s	1	1	2h46m	43.3M	617K	44.0M	$50.78
2026-04-26 03:09	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	75.3%	67	22	3600s	0	0	0h56m	10.0M	616K	10.6M	$93.92
2026-04-25 21:26	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	75.3%	67	22	3600s	5	0	1h15m	21.9M	2.3M	24.2M	$90.29
2026-04-25 20:08	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	79.8%	71	17	3600s	7	1	1h17m	29.7M	2.7M	32.4M	$108.55
2026-04-25 18:28	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	74.2%	66	22	3600s	7	1	1h40m	24.5M	2.3M	26.8M	$93.99
2026-04-25 17:15	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	78.7%	70	19	3600s	5	0	1h12m	27.8M	2.4M	30.2M	$106.37
2026-04-25 15:58	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	76.4%	68	21	3600s	4	0	1h16m	20.3M	2.4M	22.7M	$98.74
2026-04-25 12:44	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	68.5%	61	28	3600s	2	0	1h02m	19.0M	888K	19.9M	$55.56
2026-04-25 11:43	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	71.9%	64	25	3600s	1	0	1h00m	17.4M	834K	18.3M	$52.43
2026-04-25 10:02	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	68.5%	61	27	3600s	3	1	1h40m	63.4M	902K	64.3M	$78.48
2026-04-25 08:56	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	67.4%	60	29	3600s	2	0	1h06m	14.9M	818K	15.8M	$51.27
2026-04-25 07:46	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	69.7%	62	27	3600s	4	0	1h09m	31.5M	1.0M	32.5M	$67.03
2026-04-25 06:39	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	65.2%	58	31	3600s	6	0	1h06m	2.6M	189K	2.8M	$25.47
2026-04-25 05:32	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	70.8%	63	26	3600s	5	0	1h06m	2.4M	150K	2.6M	$23.05
2026-04-25 04:25	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	69.7%	62	27	3600s	6	0	1h07m	4.2M	198K	4.4M	$34.24
2026-04-25 02:45	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	74.2%	66	22	3600s	7	1	1h40m	4.6M	183K	4.8M	$36.73
2026-04-25 01:50	Hermes Agent (v2026.3.30)	wandbqa	moonshotai	Kimi K2.6 [Moonshot AI]	-	20.2%	18	70	3600s	55	1	3h09m	10.7M	1.8M	12.6M	$20.41
2026-04-25 01:38	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	71.9%	64	25	3600s	6	0	1h06m	4.0M	176K	4.2M	$35.85
2026-04-24 21:18	Hermes Agent (v2026.3.30)	wandbqa	moonshotai	Kimi K2.6 [Moonshot AI]	-	37.1%	33	56	3600s	49	0	4h32m	15.1M	2.6M	17.7M	$29.10
2026-04-23 10:40	Hermes Agent (v2026.3.30)	moonshotai	moonshotai	Kimi K2.6 [Moonshot AI]	-	57.3%	51	38	3600s	13	0	3h50m	4.2M	2.2M	6.4M	$27.97
2026-04-23 06:22	Hermes Agent (v2026.3.30)	moonshotai	moonshotai	Kimi K2.6 [Moonshot AI]	-	64.0%	57	32	3600s	14	0	4h17m	4.2M	2.3M	6.6M	$27.50
2026-04-23 02:01	Hermes Agent (v2026.3.30)	wandb	zai-org	GLM-5.1 [W&B]	-	41.6%	37	47	3600s	6	5	5h48m	13.6M	768K	14.4M	$37.81
2026-04-23 01:45	Hermes Agent (v2026.3.30)	moonshotai	moonshotai	Kimi K2.6 [Moonshot AI]	-	57.3%	51	35	3600s	13	3	4h36m	3.8M	2.2M	6.0M	$27.32
2026-04-22 20:52	Hermes Agent (v2026.3.30)	wandb	zai-org	GLM-5.1 [W&B]	-	47.2%	42	44	3600s	4	3	5h08m	13.3M	779K	14.1M	$35.11
2026-04-22 14:28	Hermes Agent (v2026.3.30)	wandb	zai-org	GLM-5.1 [W&B]	-	42.7%	38	45	3600s	4	6	6h23m	12.5M	708K	13.2M	$33.07
2026-04-22 08:12	Hermes Agent (v2026.3.30)	wandb	zai-org	GLM-5.1 [W&B]	-	42.7%	38	44	3600s	4	7	6h16m	15.0M	789K	15.8M	$39.01
2026-04-21 18:57	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	65.2%	58	30	3600s	3	1	4h41m	3.6M	1.1M	4.7M	$88.97
2026-04-21 18:27	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	60.7%	54	35	3600s	15	0	1h55m	60.0M	3.2M	63.1M	$23.91
2026-04-21 15:55	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	55.1%	49	39	3600s	12	1	2h32m	55.5M	2.8M	58.3M	$21.88
2026-04-21 14:42	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	65.2%	58	30	3600s	6	1	4h15m	3.5M	1.0M	4.6M	$81.82
2026-04-21 14:07	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	62.9%	56	33	3600s	14	0	1h47m	57.4M	2.8M	60.3M	$22.44
2026-04-21 11:47	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	58.4%	52	37	3600s	13	0	2h20m	51.2M	2.8M	54.0M	$21.22
2026-04-21 10:08	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	62.9%	56	30	3600s	7	3	4h33m	3.8M	1.2M	5.0M	$96.34
2026-04-21 09:59	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	57.3%	51	38	3600s	14	0	1h47m	55.4M	3.0M	58.4M	$22.61
2026-04-21 06:45	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	53.9%	48	40	3600s	13	1	3h13m	10.5M	3.8M	14.2M	$50.32
2026-04-21 05:29	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	56.2%	50	39	3600s	11	0	1h15m	18.5M	3.4M	21.9M	$55.80
2026-04-21 05:01	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	67.4%	60	28	3600s	5	1	5h07m	3.7M	1.3M	5.0M	$104.36
2026-04-21 03:49	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	62.9%	56	32	3600s	12	1	1h40m	14.0M	4.2M	18.2M	$60.50
2026-04-21 02:32	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	60.7%	54	35	3600s	11	0	1h16m	14.0M	3.5M	17.5M	$51.09
2026-04-21 00:51	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	59.6%	53	35	3600s	14	1	1h41m	18.7M	4.3M	23.0M	$66.00
2026-04-21 00:32	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	70.8%	63	25	3600s	3	1	4h28m	3.4M	1.2M	4.6M	$96.19
2026-04-18 01:01	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	76.4%	68	20	3600s	7	1	1h40m	3K	1.3M	1.3M	$93.81
2026-04-17 23:30	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	70.8%	63	26	3600s	2	0	1h03m	57.3M	998K	58.3M	$68.67
2026-04-17 22:38	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	61.8%	55	33	3600s	10	1	1h18m	41.1M	863K	42.0M	$52.27
2026-04-17 21:52	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	77.5%	69	20	3600s	7	0	1h38m	4K	1.7M	1.7M	$150.58
2026-04-17 21:15	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	60.7%	54	34	3600s	8	1	1h22m	48.1M	918K	49.0M	$57.70
2026-04-17 20:47	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	73.0%	65	24	3600s	3	0	1h05m	109.5M	2.5M	112.0M	$117.70
2026-04-17 19:42	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	73.0%	65	24	3600s	1	0	1h03m	73.5M	1.0M	74.6M	$77.47
2026-04-17 18:31	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	67.4%	60	29	3600s	4	0	1h10m	3K	1.4M	1.4M	$112.27
2026-04-17 18:04	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	57.3%	51	37	3600s	9	1	1h16m	38.6M	1.2M	39.8M	$64.85
2026-04-17 16:57	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	66.3%	59	29	3600s	6	1	1h06m	61.2M	1.5M	62.7M	$85.98
2026-04-17 16:52	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	73.0%	65	24	3600s	6	0	1h39m	107.0M	2.3M	109.3M	$109.87
2026-04-17 12:52	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	70.8%	63	25	3600s	4	1	1h40m	98.5M	1.3M	99.9M	$101.10
2026-04-17 11:48	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	74.2%	66	23	3600s	3	0	1h04m	3K	1.3M	1.3M	$104.53
2026-04-17 11:37	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	67.4%	60	27	3600s	6	2	1h15m	70.9M	1.6M	72.5M	$99.51
2026-04-17 10:09	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	73.0%	65	24	3600s	7	0	1h39m	109.9M	2.3M	112.2M	$113.49
2026-04-17 08:08	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	71.9%	64	24	3600s	5	1	2h01m	79.8M	1.2M	81.0M	$86.18
2026-04-17 05:31	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	74.2%	66	18	3600s	5	5	1h19m	134.7M	2.5M	137.2M	$130.49
2026-04-17 04:04	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	70.8%	63	25	3600s	3	1	1h27m	71.2M	1.2M	72.3M	$81.30
2026-04-17 02:46	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	80.9%	72	17	3600s	5	0	1h17m	3K	1.2M	1.2M	$95.47
2026-04-17 01:18	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	74.2%	66	23	3600s	5	0	1h28m	118.9M	2.4M	121.3M	$118.27
2026-04-15 13:59	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	44.9%	40	47	3600s	17	2	2h18m	51.2M	1.6M	52.8M	$56.39
2026-04-15 07:56	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	48.3%	43	44	3600s	20	2	2h35m	57.5M	1.7M	59.2M	$63.03
2026-04-15 06:01	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	47.2%	42	46	3600s	20	1	1h54m	50.6M	1.7M	52.3M	$56.10
2026-04-15 04:01	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	51.7%	46	42	3600s	14	1	1h59m	40.8M	1.6M	42.3M	$45.76
2026-04-15 01:29	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	43.8%	39	48	3600s	19	2	2h32m	53.2M	1.7M	54.8M	$58.49
2026-04-15 00:23	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	31.5%	28	61	3600s	2	0	1h04m	6.1M	461K	6.5M	$15.20
2026-04-14 21:17	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	43.8%	39	45	3600s	29	5	2h45m	16.1M	937K	17.1M	$16.00
2026-04-14 18:51	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	41.6%	37	52	3600s	38	0	2h25m	16.3M	925K	17.3M	$18.26
2026-04-14 17:10	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	36.0%	32	56	3600s	7	1	1h40m	7.6M	502K	8.1M	$17.95
2026-04-14 07:53	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	39.3%	35	54	3600s	44	0	3h21m	9.6M	642K	10.2M	$12.28
2026-04-14 03:58	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	40.4%	36	50	3600s	37	3	3h54m	14.5M	868K	15.4M	$16.39
2026-04-14 01:20	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	47.2%	42	46	3600s	40	1	2h38m	16.3M	856K	17.2M	$19.26
2026-04-13 23:48	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	25.8%	23	66	3600s	4	0	1h31m	5.5M	395K	5.9M	$13.78
2026-04-13 22:07	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	30.3%	27	61	3600s	5	1	1h40m	3.5M	414K	3.9M	$10.62
2026-04-13 20:26	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	39.3%	35	53	3600s	10	1	1h40m	14.1M	658K	14.8M	$32.39
2026-04-09 11:12	Terminus-2 (2.0.0)	wandb	google	Gemma 4 31B [W&B]	-	31.5%	28	59	3600s	8	2	1h41m	95.1M	1.3M	96.3M	$30.10
2026-04-09 09:31	Terminus-2 (2.0.0)	wandb	google	Gemma 4 31B [W&B]	-	31.5%	28	60	3600s	13	1	1h40m	136.2M	1.5M	137.8M	$42.78
2026-04-09 08:03	Terminus-2 (2.0.0)	wandb	google	Gemma 4 31B [W&B]	-	30.3%	27	62	3600s	11	0	1h27m	127.1M	1.2M	128.3M	$39.63
2026-04-09 06:49	Terminus-2 (2.0.0)	wandb	google	Gemma 4 31B [W&B]	-	32.6%	29	59	3600s	12	1	1h13m	114.3M	1.5M	115.7M	$36.12
2026-04-09 05:09	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	19.1%	17	71	3600s	5	1	1h40m	22.3M	1.3M	23.6M	$45.69
2026-04-09 03:57	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	18.0%	16	73	3600s	2	0	1h11m	19.7M	1.5M	21.3M	$62.97
2026-04-09 01:52	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	19.1%	17	71	3600s	8	1	2h04m	20.6M	1.4M	22.0M	$61.87
2026-04-09 00:25	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	16.9%	15	74	3600s	3	0	1h26m	22.3M	1.6M	23.9M	$55.84
2026-04-08 23:18	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	18.0%	16	73	3600s	7	0	1h07m	17.8M	1.5M	19.2M	$39.12
2026-04-06 07:52	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	24.7%	22	66	3600s	3	1	1h40m	91.5M	2.2M	93.7M	$8.05
2026-04-06 06:11	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	28.1%	25	63	3600s	2	1	1h40m	92.3M	1.5M	93.8M	$6.86
2026-04-06 05:06	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	25.8%	23	66	3600s	2	0	1h05m	52.6M	2.0M	54.6M	$6.18
2026-04-06 03:25	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	21.3%	19	69	3600s	2	1	1h40m	83.1M	1.9M	85.0M	$7.16
2026-04-06 02:22	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	25.8%	23	66	3600s	2	0	1h02m	117.4M	2.4M	119.8M	$9.14
2026-04-06 01:18	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	20.2%	18	71	3600s	5	0	1h04m	128.7M	837K	129.5M	$72.40
2026-04-05 23:37	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	21.3%	19	69	3600s	7	1	1h40m	96.4M	745K	97.1M	$54.48
2026-04-05 21:45	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	22.5%	20	69	3600s	4	0	1h51m	90.5M	741K	91.2M	$51.04
2026-04-05 20:04	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	24.7%	22	66	3600s	5	1	1h40m	145.5M	830K	146.3M	$80.76
2026-04-05 17:06	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	25.8%	23	66	3600s	6	0	2h57m	68.3M	697K	69.0M	$38.97
2026-04-05 11:05	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	41.6%	37	51	3600s	4	1	1h40m	148.0M	1.2M	149.2M	$15.99
2026-04-05 09:59	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	41.6%	37	52	3600s	5	0	1h05m	141.3M	1.2M	142.5M	$16.05
2026-04-05 08:19	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	46.1%	41	47	3600s	3	1	1h40m	161.3M	1.3M	162.6M	$17.56
2026-04-05 07:15	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	48.3%	43	45	3600s	4	1	1h03m	252.2M	1.6M	253.7M	$23.53
2026-04-05 06:09	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	43.8%	39	50	3600s	5	0	1h05m	131.9M	1.2M	133.1M	$15.29
2026-04-05 04:28	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	40.4%	36	52	3600s	9	1	1h40m	112.1M	472K	112.5M	$62.34
2026-04-05 02:47	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	36.0%	32	56	3600s	7	1	1h40m	197.5M	653K	198.2M	$109.73
2026-04-05 01:06	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	46.1%	41	47	3600s	9	1	1h40m	141.1M	753K	141.8M	$78.99
2026-04-04 23:25	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	40.4%	36	52	3600s	7	1	1h40m	114.2M	670K	114.8M	$63.90
2026-04-04 21:44	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	40.4%	36	52	3600s	9	1	1h40m	538.5M	1.2M	539.8M	$292.05
2026-04-03 11:44	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	50.6%	45	44	3600s	0	0	0h31m	9.4M	653K	10.1M	$15.80
2026-04-03 11:08	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	56.2%	50	39	3600s	0	0	0h35m	9.5M	568K	10.0M	$14.67
2026-04-03 10:41	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	52.8%	47	42	3600s	0	0	0h26m	13.8M	700K	14.5M	$18.04
2026-04-03 09:01	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	50.6%	45	43	3600s	1	1	1h40m	8.1M	634K	8.8M	$14.92
2026-04-03 07:58	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	48.3%	43	45	3600s	1	1	1h02m	11.5M	670K	12.2M	$16.78
2026-04-03 06:53	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	59.6%	53	36	3600s	6	0	1h05m	119.9M	638K	120.5M	$269.13
2026-04-03 05:46	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	57.3%	51	38	3600s	5	0	1h06m	118.2M	748K	118.9M	$266.99
2026-04-03 04:05	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	60.7%	54	34	3600s	7	1	1h40m	70.4M	652K	71.1M	$160.84
2026-04-03 02:24	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	62.9%	56	32	3600s	8	1	1h40m	125.3M	696K	126.0M	$281.66
2026-04-03 01:18	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	56.2%	50	39	3600s	6	0	1h06m	55.8M	485K	56.3M	$126.81
2026-04-02 12:00	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	61.8%	55	31	3600s	5	3	6h17m	3.5M	1.0M	4.6M	$80.06
2026-04-02 07:39	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	39.3%	35	52	3600s	6	2	5h23m	15.5M	1.6M	17.1M	$20.16
2026-04-02 07:08	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	64.0%	57	32	3600s	5	0	4h51m	3.9M	1.0M	4.9M	$86.16
2026-04-02 05:25	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	70.8%	63	26	3600s	2	0	2h38m	2.9M	996K	3.9M	$41.56
2026-04-02 03:25	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	40.4%	36	52	3600s	6	1	4h13m	15.8M	1.7M	17.6M	$21.30
2026-04-02 02:55	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	65.2%	58	31	3600s	2	0	2h28m	3.1M	960K	4.1M	$39.07
2026-04-02 00:41	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	67.4%	60	29	3600s	5	0	6h25m	4.2M	1.2M	5.4M	$92.30
2026-04-01 23:30	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	39.3%	35	53	3600s	5	1	3h54m	13.9M	1.3M	15.1M	$17.26
2026-04-01 20:09	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	66.3%	59	30	3600s	3	0	2h22m	3.5M	1.0M	4.5M	$44.63
2026-04-01 19:52	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	44.9%	40	49	3600s	4	0	3h38m	13.9M	1.6M	15.5M	$19.64
2026-04-01 19:49	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	61.8%	55	34	3600s	4	0	4h51m	3.6M	1.1M	4.7M	$83.45
2026-04-01 17:48	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	65.2%	58	30	3600s	3	1	2h21m	3.0M	900K	3.9M	$37.13
2026-04-01 14:45	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	42.7%	38	50	3600s	7	1	5h06m	17.0M	1.6M	18.6M	$22.04
2026-04-01 14:45	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	64.0%	57	31	3600s	1	1	3h02m	2.7M	847K	3.5M	$34.84
2026-04-01 14:44	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	64.0%	57	31	3600s	3	1	5h04m	3.6M	1.1M	4.7M	$82.54
2026-03-29 07:01	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	39.3%	35	45	3600s	6	9	2h57m	21.5M	1.1M	22.6M	$137.07
2026-03-29 04:05	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	37.1%	33	51	3600s	1	5	2h55m	16.3M	923K	17.2M	$94.29
2026-03-29 01:00	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	38.2%	34	50	3600s	3	5	3h04m	17.0M	861K	17.8M	$107.46
2026-03-27 19:54	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	31.5%	28	53	3600s	5	8	3h07m	17.0M	923K	17.9M	$105.64
2026-03-27 16:16	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	37.1%	33	47	3600s	2	9	3h37m	18.5M	797K	19.3M	$92.76
2026-03-27 13:20	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	42.7%	38	47	3600s	0	4	2h55m	37.3M	984K	38.3M	$12.36
2026-03-27 11:08	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	37.1%	33	49	3600s	1	7	2h12m	35.1M	1.0M	36.1M	$23.01
2026-03-27 08:25	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	37.1%	33	50	3600s	0	6	2h42m	34.1M	885K	34.9M	$11.28
2026-03-27 06:53	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	50.6%	45	42	3600s	24	2	1h31m	69.5M	1.4M	70.9M	$22.57
2026-03-27 04:47	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	43.8%	39	45	3600s	25	5	2h05m	87.7M	1.6M	89.3M	$28.18
2026-03-27 02:58	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	49.4%	44	43	3600s	17	2	1h48m	75.3M	1.5M	76.8M	$24.36
2026-03-26 12:32	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	47.2%	42	45	3600s	7	2	1h47m	175.6M	2.0M	177.5M	$45.91
2026-03-26 09:37	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	33.7%	30	53	3600s	3	6	2h51m	40.3M	969K	41.3M	$13.25
2026-03-26 07:14	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	32.6%	29	52	3600s	1	8	2h22m	34.6M	996K	35.6M	$11.58
2026-03-26 06:07	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	41.6%	37	50	3600s	31	2	1h06m	60.4M	1.5M	61.8M	$19.87
2026-03-26 04:26	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	49.4%	44	42	3600s	22	3	1h41m	64.6M	1.4M	66.0M	$21.06
2026-03-20 06:43	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	49.4%	44	45	3600s	18	0	1h33m	126.7M	2.4M	129.1M	$12.49
2026-03-20 03:45	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	55.1%	49	39	3600s	16	1	2h57m	171.9M	2.5M	174.4M	$14.97
2026-03-20 02:30	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	49.4%	44	45	3600s	7	0	1h14m	7.0M	2.4M	9.3M	$12.69
2026-03-20 00:17	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	48.3%	43	45	3600s	6	1	2h12m	6.7M	2.2M	8.9M	$10.47
2026-03-19 13:39	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 nano	-	20.2%	18	71	3600s	32	0	2h01m	674.5M	2.6M	677.1M	$21.67
2026-03-19 12:31	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	52.8%	47	42	3600s	15	0	1h31m	162.6M	2.7M	165.3M	$19.86
2026-03-19 12:08	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 nano	-	23.6%	21	68	3600s	28	0	1h31m	549.2M	2.2M	551.4M	$17.69
2026-03-19 10:50	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	47.2%	42	46	3600s	18	1	1h40m	128.5M	2.5M	131.0M	$12.44
2026-03-19 10:43	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 nano	-	23.6%	21	68	3600s	21	0	1h24m	550.9M	2.0M	552.9M	$16.52
2026-03-19 09:29	Terminus-2 (2.0.0)	mistral	mistral	Mistral Small 4 119B A6B	-	25.8%	23	59	3600s	1	7	1h54m	83.0M	1.1M	84.1M	$4.36
2026-03-19 09:19	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	55.1%	49	40	3600s	19	0	1h30m	99.7M	2.4M	102.2M	$10.72
2026-03-19 09:18	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 mini	-	27.0%	24	65	3600s	24	0	1h24m	448.2M	1.3M	449.5M	$73.71
2026-03-19 08:13	Terminus-2 (2.0.0)	mistral	mistral	Mistral Small 4 119B A6B	-	21.3%	19	70	3600s	4	0	1h15m	242.3M	1.6M	243.9M	$8.54
2026-03-19 07:38	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	46.1%	41	47	3600s	4	1	1h40m	6.3M	2.3M	8.5M	$10.26
2026-03-19 07:37	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 mini	-	25.8%	23	65	3600s	21	1	1h40m	443.6M	1.3M	444.9M	$91.01
2026-03-19 06:59	Terminus-2 (2.0.0)	mistral	mistral	Mistral Small 4 119B A6B	-	23.6%	21	68	3600s	4	0	1h13m	128.5M	1.5M	130.0M	$6.19
2026-03-19 05:58	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	42.7%	38	50	3600s	6	1	1h40m	8.9M	2.5M	11.3M	$11.87
2026-03-19 05:57	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 mini	-	25.8%	23	66	3600s	17	0	1h39m	450.6M	1.3M	451.9M	$75.65
2026-03-19 05:53	OpenClaw (2026.3.11)	mistral	mistral	Mistral Small 4 119B A6B	-	18.0%	16	72	3600s	4	1	1h05m	18.2M	772K	19.0M	$4.58
2026-03-19 04:56	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 nano	-	12.4%	11	78	3600s	1	0	1h00m	1.6M	156K	1.8M	$1.00
2026-03-19 04:48	OpenClaw (2026.3.11)	mistral	mistral	Mistral Small 4 119B A6B	-	16.9%	15	74	3600s	6	0	1h05m	19.4M	842K	20.2M	$4.94
2026-03-19 04:11	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	41.6%	37	51	3600s	3	1	1h45m	5.3M	2.3M	7.6M	$11.62
2026-03-19 03:54	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 nano	-	13.5%	12	77	3600s	1	0	1h02m	1.5M	143K	1.6M	$0.83
2026-03-19 03:24	OpenClaw (2026.3.11)	mistral	mistral	Mistral Small 4 119B A6B	-	15.7%	14	75	3600s	7	0	1h23m	23.4M	758K	24.1M	$5.35
2026-03-19 02:53	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 nano	-	16.9%	15	74	3600s	1	0	1h00m	1.8M	123K	1.9M	$0.74
2026-03-18 08:05	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	10.1%	9	80	3600s	2	0	1h02m	1.7M	170K	1.9M	$3.46
2026-03-18 07:03	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	14.6%	13	76	3600s	3	0	1h02m	1.6M	159K	1.7M	$3.03
2026-03-18 06:01	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	14.6%	13	76	3600s	2	0	1h02m	1.8M	164K	2.0M	$3.54
2026-03-18 04:58	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	18.0%	16	73	3600s	1	0	1h02m	1.8M	162K	1.9M	$3.36
2026-03-18 03:56	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	13.5%	12	77	3600s	1	0	1h02m	1.7M	155K	1.8M	$3.09
2026-03-16 13:58	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5-Turbo	-	49.4%	44	43	3600s	13	2	2h15m	268.0M	2.7M	270.7M	$194.11
2026-03-16 11:53	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5-Turbo	-	46.1%	41	48	3600s	14	0	2h03m	198.6M	2.5M	201.2M	$131.84
2026-03-16 10:27	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	47.2%	42	47	3600s	7	0	1h25m	21.0M	3.4M	24.4M	$61.88
2026-03-16 09:17	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	46.1%	41	48	3600s	6	0	1h10m	13.4M	2.5M	15.9M	$38.55
2026-03-16 07:45	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	47.2%	42	47	3600s	10	0	1h31m	17.3M	2.8M	20.0M	$44.98
2026-03-16 06:04	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	49.4%	44	44	3600s	9	1	1h40m	17.8M	2.7M	20.5M	$48.94
2026-03-16 04:23	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	43.8%	39	49	3600s	6	1	1h40m	20.3M	3.4M	23.7M	$61.05
2026-03-16 01:49	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	31.5%	28	61	3600s	21	0	2h01m	153.9M	4.0M	157.9M	$34.00
2026-03-15 23:55	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	38.2%	34	55	3600s	19	0	1h54m	150.4M	3.9M	154.3M	$33.17
2026-03-15 22:13	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	31.5%	28	61	3600s	16	0	1h41m	132.0M	3.9M	135.9M	$29.52
2026-03-15 20:17	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	38.2%	34	55	3600s	19	0	1h55m	151.3M	4.0M	155.3M	$33.49
2026-03-15 18:06	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	39.3%	35	54	3600s	22	0	2h10m	177.4M	4.3M	181.7M	$38.92
2026-03-15 01:29	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	56.2%	50	38	3600s	7	1	1h40m	2K	1.4M	1.4M	$73.23
2026-03-14 23:48	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	52.8%	47	41	3600s	10	1	1h40m	2K	1.7M	1.7M	$86.13
2026-03-14 22:09	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	59.6%	53	36	3600s	8	0	1h39m	2K	1.7M	1.7M	$79.86
2026-03-14 20:31	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	58.4%	52	37	3600s	7	0	1h37m	2K	1.6M	1.6M	$85.44
2026-03-14 19:41	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	19.1%	17	71	3600s	4	1	1h40m	72.6M	765K	73.3M	$15.13
2026-03-14 18:50	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	59.6%	53	35	3600s	5	1	1h40m	2K	1.9M	1.9M	$96.75
2026-03-14 18:18	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	23.6%	21	68	3600s	7	0	1h22m	83.9M	1.0M	84.9M	$17.59
2026-03-14 17:45	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	34	3600s	5	1	1h04m	73.3M	1.5M	74.8M	$74.75
2026-03-14 17:09	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	16.9%	15	74	3600s	4	0	1h08m	54.7M	773K	55.4M	$11.55
2026-03-14 16:38	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	57.3%	51	37	3600s	9	1	1h07m	67.5M	1.5M	69.0M	$72.43
2026-03-14 15:33	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	21.3%	19	69	3600s	8	1	1h35m	75.5M	961K	76.5M	$15.88
2026-03-14 15:20	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	34	3600s	6	1	1h17m	66.3M	1.8M	68.1M	$77.90
2026-03-14 14:20	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	20.2%	18	70	3600s	4	1	1h12m	62.0M	967K	63.0M	$13.18
2026-03-14 14:05	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	34	3600s	7	1	1h15m	73.4M	1.6M	74.9M	$76.21
2026-03-14 12:32	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	58.4%	52	36	3600s	9	1	1h32m	88.0M	1.4M	89.4M	$78.75
2026-03-14 10:34	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	55.1%	49	39	3600s	21	1	1h57m	40.5M	2.6M	43.1M	$103.35
2026-03-14 08:48	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	58.4%	52	36	3600s	16	1	1h45m	32.3M	2.5M	34.8M	$95.12
2026-03-14 07:02	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	35	3600s	16	0	1h45m	42.4M	2.3M	44.7M	$93.61
2026-03-14 04:57	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	61.8%	55	34	3600s	15	0	2h04m	38.8M	2.3M	41.2M	$92.72
2026-03-14 02:54	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	35	3600s	19	0	2h02m	38.1M	2.3M	40.4M	$91.68
2026-03-12 21:45	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	59.6%	53	35	3600s	12	1	1h40m	4.2M	577K	4.8M	$32.65
2026-03-12 20:33	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	61.8%	55	34	3600s	11	0	1h10m	9.1M	613K	9.7M	$49.96
2026-03-12 19:23	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	57.3%	51	38	3600s	10	0	1h10m	7.9M	602K	8.5M	$44.48
2026-03-12 18:18	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	59.6%	53	36	3600s	10	0	1h05m	7.8M	618K	8.4M	$43.76
2026-03-12 17:08	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	66.3%	59	30	3600s	10	0	1h09m	8.9M	603K	9.5M	$48.90
2026-03-12 12:49	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	70.8%	63	26	3600s	13	0	1h12m	22.2M	1.7M	23.9M	$110.57
2026-03-12 11:27	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	71.9%	64	25	3600s	11	0	1h21m	26.2M	1.7M	27.8M	$117.72
2026-03-12 10:27	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	67.4%	60	28	3600s	11	1	2h15m	14.7M	6.9M	21.6M	$117.49
2026-03-12 10:16	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	70.8%	63	26	3600s	13	0	1h10m	19.7M	1.8M	21.5M	$108.25
2026-03-12 09:03	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	69.7%	62	27	3600s	12	0	1h12m	20.7M	1.9M	22.6M	$114.24
2026-03-12 08:39	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	73.0%	65	24	3600s	11	0	1h47m	13.5M	6.3M	19.9M	$105.05
2026-03-12 07:38	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	71.9%	64	25	3600s	10	0	1h25m	19.7M	1.6M	21.3M	$104.94
2026-03-12 06:17	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	64.0%	57	31	3600s	10	1	2h21m	12.4M	6.1M	18.5M	$100.46
2026-03-12 04:57	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	70.8%	63	26	3600s	8	0	1h19m	10.3M	5.5M	15.8M	$91.64
2026-03-12 03:25	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	69.7%	62	27	3600s	11	0	1h31m	13.3M	5.7M	19.0M	$98.24
2026-03-10 12:31	Terminus-2 (2.0.0)	openai	openai	Kimi K2.5 (nvfp4) [W&B]	-	47.2%	42	47	3600s	12	0	1h27m	114.4M	1.8M	116.2M	$73.93
2026-03-10 11:30	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	53.9%	48	41	3600s	5	0	1h04m	3.2M	360K	3.5M	$15.86
2026-03-10 10:15	Terminus-2 (2.0.0)	openai	openai	Kimi K2.5 (nvfp4) [W&B]	-	49.4%	44	44	3600s	11	1	2h15m	140.6M	2.2M	142.8M	$90.90
2026-03-10 09:49	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	55.1%	49	39	3600s	8	1	1h40m	3.8M	339K	4.1M	$16.23
2026-03-10 08:43	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	56.2%	50	38	3600s	7	1	1h05m	4.2M	356K	4.6M	$17.86
2026-03-10 08:42	Terminus-2 (2.0.0)	openai	openai	Kimi K2.5 (nvfp4) [W&B]	-	46.1%	41	48	3600s	13	0	1h32m	138.5M	2.0M	140.5M	$89.11
2026-03-10 07:38	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	56.2%	50	39	3600s	5	0	1h04m	2.7M	371K	3.1M	$14.97
2026-03-10 06:29	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	53.9%	48	41	3600s	8	0	1h09m	2.6M	351K	3.0M	$14.34
2026-03-10 06:25	Terminus-2 (2.0.0)	openai	openai	Kimi K2.5 (nvfp4) [W&B]	-	46.1%	41	47	3600s	13	1	2h17m	116.0M	2.1M	118.1M	$75.84
2026-03-10 04:58	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	37.1%	33	56	3600s	14	0	1h26m	181.2M	1.5M	182.7M	$113.17
2026-03-10 03:33	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	38.2%	34	55	3600s	10	0	1h25m	144.8M	1.3M	146.1M	$90.83
2026-03-10 01:37	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	33.7%	30	59	3600s	9	0	1h55m	167.6M	1.4M	169.0M	$104.75
2026-03-10 00:20	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	38.2%	34	55	3600s	10	0	1h16m	237.4M	1.6M	239.1M	$147.37
2026-03-09 23:12	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	37.1%	33	55	3600s	10	1	1h07m	92.9M	1.1M	94.0M	$59.02
2026-03-09 14:05	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	38.2%	34	55	3600s	13	0	1h14m	233.9M	622K	234.5M	$65.24
2026-03-09 12:59	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	41.6%	37	52	3600s	13	0	1h05m	243.8M	570K	244.4M	$65.75
2026-03-09 11:33	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	38.2%	34	55	3600s	17	0	1h25m	320.8M	671K	321.4M	$85.47
2026-03-09 09:42	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	39.3%	35	53	3600s	11	1	1h50m	262.1M	646K	262.7M	$69.37
2026-03-09 08:09	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	38.2%	34	55	3600s	13	0	1h32m	245.1M	687K	245.8M	$68.48
2026-03-09 07:46	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	28.1%	25	64	3600s	7	0	1h04m	8.2M	288K	8.5M	$30.21
2026-03-09 07:00	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	69.7%	62	26	3600s	8	1	1h08m	75.2M	1.6M	76.7M	$98.64
2026-03-09 06:38	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	32.6%	29	60	3600s	11	0	1h07m	9.2M	345K	9.5M	$34.79
2026-03-09 05:53	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	71.9%	64	25	3600s	5	0	1h06m	77.6M	1.5M	79.2M	$98.01
2026-03-09 05:33	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	31.5%	28	61	3600s	7	0	1h04m	7.6M	326K	7.9M	$30.62
2026-03-09 04:27	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	30.3%	27	62	3600s	7	0	1h06m	11.2M	322K	11.5M	$40.66
2026-03-09 04:13	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	70.8%	63	25	3600s	8	1	1h40m	89.8M	1.7M	91.4M	$108.17
2026-03-09 03:21	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	29.2%	26	63	3600s	5	0	1h05m	7.8M	333K	8.2M	$30.03
2026-03-09 03:07	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	68.5%	61	28	3600s	4	0	1h05m	78.4M	1.4M	79.8M	$96.11
2026-03-08 19:15	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	51.7%	46	42	3600s	14	1	1h26m	105.0M	1.7M	106.7M	$18.31
2026-03-08 17:46	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	48.3%	43	46	3600s	12	0	1h29m	99.5M	1.7M	101.2M	$17.81
2026-03-08 16:04	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	46.1%	41	48	3600s	14	0	1h41m	120.9M	1.7M	122.6M	$20.20
2026-03-08 14:26	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	75.3%	67	22	3600s	3	0	1h03m	79.4M	1.4M	80.8M	$94.74
2026-03-08 13:51	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	49.4%	44	44	3600s	13	1	2h12m	100.6M	1.7M	102.3M	$18.02
2026-03-08 12:24	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	46.1%	41	48	3600s	15	0	1h26m	101.6M	1.7M	103.3M	$18.00
2026-03-08 12:12	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	61.8%	55	34	3600s	10	0	1h07m	132.2M	2.2M	134.4M	$89.78
2026-03-08 10:25	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	59.6%	53	36	3600s	7	0	1h46m	110.3M	1.9M	112.2M	$76.69
2026-03-08 09:53	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	37.1%	33	55	3600s	13	1	2h30m	15.0M	1.6M	16.6M	$31.55
2026-03-08 09:18	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	62.9%	56	33	3600s	13	0	1h06m	98.7M	2.0M	100.7M	$76.46
2026-03-08 08:31	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	39.3%	35	53	3600s	13	1	1h21m	17.2M	1.6M	18.8M	$32.28
2026-03-08 08:10	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	62.9%	56	33	3600s	6	0	1h08m	96.7M	1.9M	98.6M	$70.90
2026-03-08 07:14	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	34.8%	31	58	3600s	7	0	1h16m	20.1M	1.6M	21.6M	$37.56
2026-03-08 06:21	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	64.0%	57	32	3600s	8	0	1h48m	77.5M	1.9M	79.4M	$64.31
2026-03-08 05:38	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	44.9%	40	49	3600s	11	0	1h35m	16.0M	1.4M	17.4M	$29.95
2026-03-08 05:12	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	57.3%	51	38	3600s	10	0	1h09m	2K	1.4M	1.4M	$82.97
2026-03-08 04:09	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	38.2%	34	54	3600s	6	1	1h29m	17.4M	1.6M	19.0M	$30.69
2026-03-08 04:03	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	57.3%	51	38	3600s	7	0	1h08m	2K	1.3M	1.3M	$70.61
2026-03-08 02:53	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	56.2%	50	39	3600s	10	0	1h10m	2K	1.3M	1.3M	$70.44
2026-03-08 01:44	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	58.4%	52	37	3600s	8	0	1h08m	2K	1.3M	1.3M	$68.35
2026-03-08 00:37	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	58.4%	52	37	3600s	5	0	1h07m	2K	1.3M	1.3M	$73.23
2026-03-08 00:33	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	44.9%	40	48	3600s	12	1	2h06m	374.0M	1.0M	375.0M	$156.62
2026-03-07 23:28	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	51.7%	46	41	3600s	3	2	1h08m	2K	2.1M	2.1M	$60.64
2026-03-07 23:25	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	43.8%	39	50	3600s	14	0	1h08m	346.1M	905K	347.0M	$156.79
2026-03-07 22:16	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	42.7%	38	51	3600s	12	0	1h08m	362.6M	878K	363.5M	$144.82
2026-03-07 22:07	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	55.1%	49	40	3600s	5	0	1h20m	2K	2.0M	2.0M	$55.91
2026-03-07 20:58	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	56.2%	50	39	3600s	3	0	1h09m	2K	2.0M	2.0M	$53.11
2026-03-07 20:10	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	41.6%	37	51	3600s	15	1	2h05m	390.2M	939K	391.2M	$158.54
2026-03-07 19:47	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	51.7%	46	43	3600s	2	0	1h10m	2K	2.0M	2.0M	$50.92
2026-03-07 18:57	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	47.2%	42	46	3600s	14	1	1h12m	399.4M	982K	400.4M	$166.97
2026-03-07 18:15	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	48.3%	43	46	3600s	6	0	1h31m	2K	2.3M	2.3M	$69.67
2026-03-07 16:53	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	67.4%	60	28	3600s	6	1	1h22m	111.2M	1.2M	112.4M	$86.51
2026-03-07 15:47	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	62.9%	56	33	3600s	4	0	1h05m	98.0M	1.6M	99.6M	$90.35
2026-03-07 14:27	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	58.4%	52	36	3600s	6	1	1h20m	84.5M	1.2M	85.7M	$72.56
2026-03-07 13:18	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	59.6%	53	36	3600s	7	0	1h09m	94.5M	1.2M	95.7M	$76.94
2026-03-07 11:53	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	67.4%	60	29	3600s	5	0	1h24m	104.5M	1.4M	106.0M	$88.62
2026-03-07 10:13	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	53.9%	48	41	3600s	12	0	1h39m	101.1M	2.1M	103.2M	$62.62
2026-03-07 09:08	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	57.3%	51	38	3600s	4	0	1h05m	83.5M	1.7M	85.1M	$50.36
2026-03-07 08:02	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	62.9%	56	33	3600s	3	0	1h05m	93.0M	1.8M	94.8M	$54.91
2026-03-07 06:56	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	57.3%	51	38	3600s	6	0	1h05m	105.4M	2.2M	107.6M	$65.40
2026-03-07 04:37	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	56.2%	50	38	3600s	5	1	2h19m	108.0M	2.3M	110.3M	$67.38

About WolfBench

by Wolfram Ravenwolf – who evaluates models for breakfast, builds agents at night, and preaches AI usefulness all day long.

Welcome to WolfBench – we’re just getting started. What you see here is an early preview with only a handful of models and agents tested so far. We’re continuously expanding the lineup, running fresh evals, and sharing interesting findings and insights along the way. Watch this space.

AI agents are becoming essential tools. Every week, a new model comes out and claims to be “the best at coding” or “SOTA on agentic tasks.” But what does that actually mean for you – the person who’s going to throw real work at these things?

A single score tells you almost nothing.

Most benchmarks give you one number: “Model X scored 42% on Benchmark Y.” Great. But can you rely on it? Was that a lucky run? Would it score the same tomorrow? What’s the floor – the tasks it always nails? What’s the ceiling – what it could do if the stars align?

WolfBench exists because we got tired of meaningless leaderboards. We wanted to know which model, which agent, and which settings actually deliver the best results on real agentic tasks – not just on paper, but in practice, consistently, across multiple runs.

What is it?

WolfBench is an evaluation framework built on top of Terminal-Bench 2.0, a popular agentic benchmark consisting of 89 diverse real-world tasks. These aren’t just coding puzzles. They span the kind of work you’d actually ask an AI agent to do:

System administration: headless terminal interaction, Git server configuration, Nginx request logging
DevOps & infrastructure: package distribution search, database WAL recovery, PyPI server setup
Security: code vulnerability fixes, 7z hash cracking, ELF binary extraction, Git leak recovery
Data & ML ops: financial document processing, HuggingFace model inference, scientific stack modernization
Problem solving: constraint scheduling, adaptive rejection sampling, concurrent task cancellation

The key word is agentic: these tasks require the model to plan, execute shell commands, inspect results, debug failures, and iterate – just like a human developer or sysadmin would. No multiple-choice shortcuts. No toy puzzles. Real work in real sandboxed environments.

Why WolfBench is different

Five-metric framework: Instead of a single average score, we report five complementary metrics that together paint a far more complete picture of what an AI agent can actually do – from the worst-case floor to the theoretical ceiling.
Uniform conditions: Instead of Terminal-Bench 2.0’s default task-specific timeouts and varying sandbox resources, every task in a run gets the same timeout and identical sandbox resources. This ensures scores reflect model and agent capability – not whether an inference endpoint was temporarily overloaded or a sandbox ran out of memory.
Multi-agent comparison: Same model, different agents. Same agent, different models. Different timeouts, concurrency levels, thinking modes. The goal is to understand what matters – not just what scored highest in one particular instance.
Multi-run methodology: A single run is statistically meaningless – variance can swing results widely. We run multiple replicates per configuration to get stable, trustworthy numbers.
Transparency: Every run is collected, classified, and curated with full metadata: tokens consumed, cache hit rates, duration, timeout, concurrency, agent version, thinking mode, etc. Nothing is hidden.

The Five-Metric Framework

Performance is a distribution, not a point. One number can’t capture what an AI agent is truly capable of. Five numbers get a lot closer.

★ Ceiling: What’s theoretically possible?

The union of all tasks ever solved across all runs. If the model solved task A in run 3 and task B in run 5 (but never both in the same run), both count toward the ceiling.

It tells you the theoretical maximum performance this model is capable of with a given agent – even if no single run achieves it. It reveals variance-limited tasks: solvable, but not reliably.

▲ Best-of: What’s the peak in a single run?

The highest score from any individual run.

This is the “marketing number” – but with context. The closer the best-of is to the average, the more consistent the model performs. A large gap between best-of and average means you’re rolling dice every time you run it.

∅ Average: What can you normally expect?

The mean score across all valid runs.

This is the most commonly reported metric – and it is useful, but only with enough runs to be stable. With a single run? It’s a coin flip.

▼ Worst-of: How bad can a single run get?

The lowest score from any individual run.

This is the opposite of best-of – the floor, the worst case. The gap between worst-of and best-of defines the full score range across all runs. A narrow range means predictable performance; a wide range means you’re rolling dice.

■ Solid: What does it always get right?

Tasks that the model solves across all runs – the rock-solid base with zero variance.

The higher the solid base, the more dependable the agent is. These are the tasks you can confidently delegate and expect success every time. A model with a high solid base and moderate average is often more reliable in practice than one with a high average but low solid base – because you know what you’re getting.

Reading the Chart

The five metrics are shown for each model/configuration as stacked bar segments from the rock-solid base up to the ceiling. Optional 3D modes add token volume, run cost, total runtime, or tokens plus cost as depth: token mode splits input tokens in front and output tokens behind, cost mode uses the total cost for depth, time mode uses total runtime for depth, and the combined mode adds a neutral gray cost shadow behind the token-depth bar. The spread between the segments tells you as much as the numbers themselves:

Tight spread (metrics close together) = consistent, predictable AI agent
Wide spread (big gap between solid and ceiling) = high variance, unreliable
High ceiling, low average = the model can do it, but usually doesn’t – needs more runs or better settings
High solid, close to average = rock-solid workhorse you can count on

The Bottom Line

Performance is more complex than a single average score – and the decisions you make based on benchmarks deserve better data than that. WolfBench gives you five angles on every model and configuration, so you can form a more complete and realistic judgement of what an AI agent will actually deliver when you put it to work.

Because at the end of the day, you don’t just want to know which model scored the highest. You want to know which one you can trust.

What’s Next

We will continuously add models and agents to the chart, publish the traces and evals on W&B Weave, and release regular blog posts detailing interesting and insightful findings.

This benchmark offers enormous potential for discovery. For instance: Why does xhigh reasoning improve GPT 5.4’s performance while max effort degrades Opus 4.6’s results? How does Claude Code fare when running a GPT or Gemini model compared to running directly with Opus or Sonnet – or Codex with Claude or Gemini? Is a “cheap” model actually cost-effective if it consumes far more tokens than a more expensive alternative? How does quantization affect performance of local models in agentic tasks?

So many possibilities for analysis – and for posting about it! Stay tuned – and if you want to be the first to know when new results come in, follow me on X and LinkedIn.

Inference and sandbox compute sponsored by CoreWeave: The Essential Cloud for AI.
Additional sandbox compute by Daytona – Secure Infrastructure for Running AI-Generated Code.
Built with Harbor for orchestration, Terminal-Bench 2.0 for tasks, and W&B Weave for tracking.
Charts and dashboards generated with marimo notebooks.
Explore the complete data and tooling suite on our WolfBench GitHub.

WolfBench (2026-07-02)

About WolfBench

What is it?

Why WolfBench is different

The Five-Metric Framework

★ Ceiling: What’s theoretically possible?

▲ Best-of: What’s the peak in a single run?

∅ Average: What can you normally expect?

▼ Worst-of: How bad can a single run get?

■ Solid: What does it always get right?

Reading the Chart

The Bottom Line

What’s Next