DuktaFeelgood on AI benchmarks

voices

4 quotes from AI researchers about benchmarks, models, and evaluation

"75% - that is the score a new AI agent called GPT 5.4 just hit on the OS World benchmark. And look at that bar right next to it. That is the average human score sitting at 72.4%. The AI did not just match us, it actually beat us."

Dukta Feelgood @DuktaFeelgood · 2026-04-08 view on x

OSWorld GPT-5.4

"It is a super robust simulation of the real everyday stuff that millions of us do at our computers every single day. We are not talking theory here. This is literally a simulation of professional white collar work. And the AI that pulled this off, it is not your standard chatbot. We are talking about GPT 5.4. It has got this massive 1 million token context window."

Dukta Feelgood @DuktaFeelgood · 2026-04-08 view on x

OSWorld GPT-5.4

"The moment an AI can score higher than a person on real world tasks and do them on its own, well, we are not just getting close to that cliff anymore, we are standing right on the edge looking down."

Dukta Feelgood @DuktaFeelgood · 2026-04-08 view on x

OSWorld GPT-5.4

"These models, GPT 5.4, Claude, Gemini, they are already available. So the advantage, the moat, it is not about having access to the tech. No, the real moat is implementation."

Dukta Feelgood @DuktaFeelgood · 2026-04-08 view on x

GPT-5.4