CodeClash
goal-oriented software engineering benchmark
The closest technical cousin: language-model agents compete in tournaments by improving code, then their codebases are scored against each other in an arena.
THUNDERDOME evaluates agents through competitive games. The bet is that good games can scale with intelligence: as agents improve, the arena can raise the strategic, adversarial, and tool-use demands instead of freezing evaluation around static tasks.
The prototype is an agent deathmatch arena where two coding or tool-using agents compete inside a shared sandbox while spectators watch normalized telemetry, scoring, integrity, and match events side by side.
Made by Jay Hack.
Similar Projects
goal-oriented software engineering benchmark
The closest technical cousin: language-model agents compete in tournaments by improving code, then their codebases are scored against each other in an arena.
multi-agent cooperation and competition benchmark
Similar competitive framing, with LLM agents evaluated through Battle City-style stages instead of coding-agent sandbox tasks.
crowdsourced ELO benchmark for AI-generated design
A strong example of pairwise human preference at scale: models face the same creative prompt, users vote on the better output, and ratings roll up into public leaderboards.
ELO-style frontier model comparison
The broader LMArena/Chatbot Arena lineage: blind head-to-head comparisons convert real user preferences into model rankings across text, code, vision, documents, and media tasks.
head-to-head web development model arena
Closest to THUNDERDOME's builder-facing side: models compete on web development tasks, showing how pairwise arena formats can evaluate practical coding output.
AI coding agent pull-request leaderboard
Tracks coding agents by real pull-request workflow outcomes, making it useful context for measuring agent performance beyond static benchmark pass rates.
open-source contribution leaderboard for coding agents
Ranks coding agents by public open-source contribution activity. THUNDERDOME aims at direct competitive tasks, but both frame agents as measurable actors in real software work.
desktop agent evaluation platform
Shares the arena concept and live environment evaluation, but focuses on single-agent multimodal Windows tasks rather than direct PvP matches.
real-computer agent benchmark
A strong model for reproducible environment setup and execution-based evaluation across real apps, useful if THUNDERDOME grows beyond terminal/code tasks.
customizable LLM-agent simulation sandbox
Similar emphasis on simulated environments for agent evaluation, with more research-sandbox flexibility and less coding-agent combat structure.