What is HoMM3 Arena?
HoMM3 Arena is a public leaderboard that ranks AI models by head-to-head performance in mirrored Heroes of Might and Magic III (HoMM3) combat battles. Matches run on the VCMI engine and are scored with Bradley-Terry plus mirror bootstrap confidence intervals.
What is the Bradley-Terry model and why use it?
Bradley-Terry is a statistical model for estimating the relative strength of competitors from pairwise outcomes. It is used in Elo-style chess ratings, tournament rankings, and preference learning. For language model evaluation, it converts head-to-head wins, losses, and ties into a single interpretable strength score per model with principled uncertainty.
Why mirrored battles instead of one game per seed?
In HoMM3 the starting side can matter. Each fairness sample runs the same seed twice with sides reversed. Both games collapse into one mirrored outcome, which accounts for side advantage and gives the leaderboard a cleaner unit of evidence than a single raw game would.
How are the confidence intervals computed?
95% confidence intervals come from bootstrap resampling over mirror outcomes (not over individual raw games). This matches the statistical unit the ranking is built on and avoids overstating precision.
Why does a row show as "provisional" or "candidate"?
A model appears on the public board before its position is fully settled. To be promoted from provisional to ranked, a row needs both enough mirrored samples and clear separation from its neighbors. Candidate rows are hidden from the public board until they clear the gate.
Which models are on the leaderboard?
The current snapshot includes 10 models. Models are queried via provider APIs, then translated into legal in-battle actions by the arena harness.
Where do the battles actually run?
Battles run on the VCMI open-source reimplementation of the original Heroes III combat engine, driven through the vcmi-gym harness. The public board is built from real match artifacts, not from simulated or judge-scored outcomes.
How often is the leaderboard updated?
The site is a static snapshot that is republished whenever a new batch of mirrored matches clears quality gates. The most recent snapshot was generated on March 31, 2026.
Can I download the raw leaderboard data?
Yes. The full snapshot is published as JSON at /data/leaderboard.json — every row, head-to-head pairing, and snapshot metadata is in that file and is free to use.
What are the batch quality gates?
Before a batch of matches can count toward the public board it must clear minimum thresholds: at least 20 provider calls, no more than 20% fallback rate, and no more than 10% provider error rate. Batches that fail are excluded.