Paper companion

EvoPolicyGym

A benchmark for studying whether coding agents can iteratively improve executable policies through environment feedback.

This companion page collects the final validation-selected Core16 policies, their original-environment reruns, and the held-out scores reported in the paper.

Benchmark protocol

Agents write policies, not action traces.

A run starts with a task contract, a policy workspace, and a fixed episode budget. The agent reads visible train feedback, edits the executable policy system, submits checkpoints, and continues from its own accepted workspace state.

Online loop edit code, submit, inspect visible train feedback
Selection checkpoint chosen by hidden validation
Measurement final score is hidden held-out mean return
Evidence videos rerun validation-selected checkpoints in the original environment

Final policy gallery

One environment at a time, four agents side by side.

The gallery is organized by environment. Each row compares the final selected checkpoint from GPT-5.5 through Codex, and Claude Opus 4.7, MiniMax-M3, and DeepSeek-V4-Pro through the Claude Code-compatible harness.

The clips are intentionally slower than the raw rollout cadence so qualitative behavior is readable: locomotion, traffic, manipulation, navigation, and failure modes remain visible without showing the intermediate optimization process.

Reported scores

Validation-selected held-out return.

Scores are raw environment returns from the selected final checkpoint. Higher is better within each environment; reward scales are not comparable across columns.