EvoPolicyGym | Final Policy Rollouts

Benchmark protocol

Agents write policies, not action traces.

A run starts with a task contract, a policy workspace, and a fixed episode budget. The agent reads visible train feedback, edits the executable policy system, submits checkpoints, and continues from its own accepted workspace state.

Online loop edit code, submit, inspect visible train feedback

Selection checkpoint chosen by hidden validation

Measurement final score is hidden held-out mean return

Evidence videos rerun validation-selected checkpoints in the original environment

Final policy gallery

One environment at a time, four agents side by side.

The gallery is organized by environment. Each row compares the final selected checkpoint from GPT-5.5 through Codex, and Claude Opus 4.7, MiniMax-M3, and DeepSeek-V4-Pro through the Claude Code-compatible harness.

The clips are intentionally slower than the raw rollout cadence so qualitative behavior is readable: locomotion, traffic, manipulation, navigation, and failure modes remain visible without showing the intermediate optimization process.

Reported scores

Validation-selected held-out return.

Scores are raw environment returns from the selected final checkpoint. Higher is better within each environment; reward scales are not comparable across columns.