Benchmark protocol
Agents write policies, not action traces.
A run starts with a task contract, a policy workspace, and a fixed episode budget. The agent reads visible train feedback, edits the executable policy system, submits checkpoints, and continues from its own accepted workspace state.
Final policy gallery
One environment at a time, four agents side by side.
The gallery is organized by environment. Each row compares the final selected checkpoint from GPT-5.5 through Codex, and Claude Opus 4.7, MiniMax-M3, and DeepSeek-V4-Pro through the Claude Code-compatible harness.
The clips are intentionally slower than the raw rollout cadence so qualitative behavior is readable: locomotion, traffic, manipulation, navigation, and failure modes remain visible without showing the intermediate optimization process.
Reported scores
Validation-selected held-out return.
Scores are raw environment returns from the selected final checkpoint. Higher is better within each environment; reward scales are not comparable across columns.