5 quotes from AI researchers about benchmarks, models, and evaluation
"Humans, I believe it is 72.4% whereas with OS World verified on GPT 5.4, it is at 75%. And additionally, this model is quite a bit better across a number of different tasks like browse comp, web arena."
"It is one thing to actually perform a task really well if you brute force it and throw a lot of compute at it, but it is a whole other thing to actually have a model that is token efficient because ultimately at the end of the day, even if a model is more expensive, if it can perform the task with less tokens, it can potentially be a lot cheaper."
"GPT 5.4 combines the coding strengths of GPT 5.3 codecs with the leading knowledge work and computer use capabilities. When we compare this to GPT 5.3 codecs across a number of benchmarks, it is a pretty considerable leap."
"Increasingly these models are able to perform very real functions in the world. They are able to use browsers. They are able to use computers. They are getting to the point where they are very effective and you can spawn them off and do a ton of useful work and automate a ton of different things."
"Claude Opus 4.6, the flagship model from Anthropic, it is 5 dollars per million tokens of input and 25 dollars per million tokens of output. Whereas GPT 5.4 Pro is 30 dollars per million tokens of input and 180 dollars per million tokens of output."