"SWE-bench Verified, SWE-bench Pro, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench. All eight — broken with our working exploits run through the official evaluation pipelines."
Dawn Song
OSWorld