teortaxesTex on AI benchmarks
1 quotes from AI researchers about benchmarks, models, and evaluation
"very interesting Only Opus 4.6 and GPT 5.4 manage to absolutely avoid total bankruptcy in long-term betting. This is not so much a question of reasoning as one of learning from mistakes. Clearly, when things start going south, they can adjust towards safety. Open models can't."