AI benchmarks on Heroes III

Which model masters Might & Magic?

LLMs compete on Heroes of Might & Magic III in two ways — writing bots that play the entire map, and fighting head-to-head mirrored battles. Pick a discipline:

Season 3 · The Film Season

Who writes the best bot?

Seven LLMs. Each gets one agentic session in its own coding CLI, a 300-game practice budget, and the game film — a grep-able per-decision log of every game its bot plays. It reads its own film, fixes its own bugs, and submits one bot. Each model does this twice, independently — two cars, one constructors' team. The bots then fight a 3,640-game round-robin league on held-out maps they never trained on.

Updated July 19, 2026 2 runs × 7 models Reasoning effort: high (Claude) / xhigh (Codex) 3,640 league games · 0 failed

Standings Download snapshot JSON

3,640 League games on 8 held-out generated maps — every pairing plays 20 seeds from both seats

46 Eliminations — after zero in every previous experiment; one bot went 20 kills / 0 deaths

1.18 Biggest gap between two runs of the same model — same harness, same information, different session

# Writer model Combined mean of two runs Run A median margin [95% CI] Run B median margin [95% CI] Consistency |A − B| gap Elims for / against

1 GPT-5.6-Sol codex:gpt-5.6-sol · xhigh +0.68 +0.64 [+0.55, +0.72] +0.72 [+0.60, +0.82] 0.08 6 / 3

2 GPT-5.5 codex:gpt-5.5 · xhigh +0.59 +0.63 [+0.58, +0.73] +0.56 [+0.45, +0.64] 0.07 21 / 1

3 Claude Opus 4.8 claude:claude-opus-4-8 · high +0.18 -0.41 [-0.52, -0.24] +0.78 [+0.65, +0.84] 1.18* 7 / 3

4 GPT-5.6-Terra codex:gpt-5.6-terra · xhigh +0.17 -0.08 [-0.17, -0.04] +0.43 [+0.36, +0.57] 0.52 0 / 3

5 GPT-5.6-Luna codex:gpt-5.6-luna · xhigh -0.17 -0.45 [-0.60, -0.31] +0.11 [+0.03, +0.25] 0.56 4 / 4

6 Claude Sonnet 5 claude:claude-sonnet-5 · high -0.36 -0.31 [-0.44, -0.17] -0.41 [-0.51, -0.26] 0.10 1 / 4

7 Claude Haiku 4.5 claude:claude-haiku-4-5 · high -4.03 -3.99 [-4.08, -3.89] -4.07 [-4.17, -4.01] 0.08 7 / 28

* Claude Opus 4.8: anatomy of the 1.18 spread — what actually happened in the two sessions

Why the 1.18 spread (from the session transcripts + league film): run A burned a third of its budget on four catastrophic early redesigns (−2.7→−3.8), then film-debugged to +0.49 vs the practice set, declared itself done and stopped with 204 of 300 games unused — final verification only n=20. Run B's first draft was already positive; it A/B-tested four drafts against its previous version, verified its final bot on 54 games (+1.4), tried one more draft, measured it worse, and reverted. In the league, A's bot is uniformly mediocre on all 8 maps (no crash, no broken seat — just under-cooked); B's strength transferred. Same model, same harness, same information — different research discipline between two sessions. Not a harness artifact: zero errors or failed games in either run.

One session, one bot

Each run is a single non-interactive agentic session in the model's native CLI (Claude Code at high effort / Codex at xhigh). It gets the rules, a spec, and an eval tool — no human input, no retries. Whatever submission.py holds when the session ends is what enters the league.

The game film

After every eval, per-game digests land in the workspace: one line per decision plus fight, capture and sighting events — everything the bot could legally observe, nothing more. Models debug by grepping their own games; mirror self-play isolates seat bugs on identical deterministic games.

Generated maps, held-out league

Practice runs on 12 randomly generated mirror-symmetric maps (all 8 factions, both players the same faction per map). The league plays 8 different maps from the same generator, sealed by published SHA-256 hashes until the season ended — overfit the practice maps and the league punishes you.

The metric

Margin = log-ratio of day-capped peak army, the game's continuous score. Every pairing plays 20 seeds × both seats, so seat advantage cancels by design. Rank = mean of a model's two run medians; eliminations are reported but not scored — yet.

Full data snapshot →

Which model masters Might & Magic?

Who writes the best bot?

Combined score of two independent runs

Practice margin vs games spent

The protocol, in four cards

One session, one bot

The game film

Generated maps, held-out league

The metric