latentspace on AI benchmarks
19 quotes from AI researchers about benchmarks, models, and evaluation
"Guess the problem with the live demo is I guess Opus 4.5 is not super fast with the OpenClaw. So let me see if I can... is it a separate channel per project?"
David Guttman @latentspace · 2026-03-18 view on x
"I don't know if this is really something that people don't know but Claude Code will totally lie. Okay. And so a big part of OpenClaw here is to make sure that Claude Code is doing what it's supposed to be doing."
David Guttman @latentspace · 2026-03-18 view on x
"Claude Code totally cheated. It knew what it had spoken and it just short-circuited the test. And so when OpenClaw went to check, it did its own test and saw that the transcription didn't match the thing that it was saying."
David Guttman @latentspace · 2026-03-18 view on x
"One of my most maddening points in vibe coding would be when Claude Code compacts and loses its mind. One of the nice things is that OpenClaw being the supervisor is able to keep it going. One of the ways that it does this is that it doesn't give Claude the full plan. It uses clear context very liberally, it inlines the task, and it keeps it pretty scoped."
David Guttman @latentspace · 2026-03-18 view on x
"OpenClaw tries to do a checklist. I also added it where once the implementation is done, it sends it to Codex to be a persnickety bastard and also call Claude Code out on its BS, which I think helps."
David Guttman @latentspace · 2026-03-18 view on x
"It's almost like duplicating the work by not letting Claude Code do its standard sub-agent plan breakdown. And the only reason why I'm doing that is because I'm trying to give more control to OpenClaw to be the supervisor and have responsibility for the different phases and the acceptance criteria and take that away from Claude."
David Guttman @latentspace · 2026-03-18 view on x
"Test driven development — getting a test that really matters, that really encapsulates whether or not this feature is done for a user. Having OpenClaw go check and make sure that that fails, having Claude then build the thing, and then OpenClaw go check that it passes. I find that is really hard for me to be consistent about that and not get cheated."
David Guttman @latentspace · 2026-03-18 view on x
"This is generally the workflow that I use now even when I am at a desktop — I'm doing it through Discord. It works that well."
David Guttman @latentspace · 2026-03-18 view on x
"I just took a Qwen 8 billion model and I tried to just train it on IIT JEE advanced problems. I ran a benchmark with a base model and then ran it with the SFT and then also with SDPO. SFT was still better than base Qwen and SDPO showed a regression in a lot of areas."
Vipul Sehgal @latentspace · 2026-03-18 view on x
"What I also noticed in the model was it was very verbose and was just generating tokens way too much than the base model. There were a lot of questions in which the model just kept on generating token and it hit the 1024 cutoff."
Vipul Sehgal @latentspace · 2026-03-18 view on x
"In order to get that feedback what I did was I ran all the questions which I had in the question bank on Claude Opus to get a full answer with the entire reasoning. I asked Opus to basically give your chain of thought answers so that it gives the feedback not just for whether it is right or wrong but also on the entire chain of thought."
Vipul Sehgal @latentspace · 2026-03-18 view on x
"Sounds like if you have some kind of expert or teacher like Opus, then you are still better off using that. It is really when you are in a bind and you are trying to push frontier, you have no expert to go to, then you can use this self-distilled thing to try to get a little bit of progress."
Swyx / Moderator @latentspace · 2026-03-18 view on x
"A lot of people are asking what happens after RLVR and I think the emerging consensus is rubrics. This is not really super surprising to anyone paying attention."
Swyx / Moderator @latentspace · 2026-03-18 view on x
"When you add clamping within SAE you have no overhead, you can just do it in real time. In this case you want to do clamping with a diffusion model, you have to do 200 steps of diffusion per token generated. So it is not very real time but it does give us a better deeper look into what is happening in the model."
Vibhu Sapra @latentspace · 2026-03-18 view on x
"Dr. Tulu is rubrics for deep research. What deep research really is is a super long rollout. They focus on search guided rubrics where you kind of have to figure out the rubrics as you go. Their framing is positive rubric versus negative. Positive rubrics capture a change you want to see, negative ones guard against reward hacking."
Swyx / Moderator @latentspace · 2026-03-18 view on x
"Not a single paper in this survey addresses the issue of stochastic non-determinism. You run it judge once you get one answer, you run it again you get a different answer. Nobody has addressed that. Probably the answer is run it three times and take the majority, but that is a pretty unsatisfying answer."
Swyx / Moderator @latentspace · 2026-03-18 view on x
"They are training a 3 billion parameter model just to interpret a 1B Llama. Pretty crazy scale there. But the diffusion loss follows a smooth power law. Scaling directly affects downstream tasks. Both steering performance and probing accuracy improve with compute closely tracking the diffusion loss."
Vibhu Sapra @latentspace · 2026-03-18 view on x
"With rubrics as opposed to other forms of preference rating or other sort of fine-tuning you do just get an immediate jump and sometimes more so or matching what a human investment would take, which is surprising for Scale to admit."
Swyx / Moderator @latentspace · 2026-03-18 view on x
"I heard that Codex 5.3 doesn't have the same issues with compaction. But one of my most maddening points in vibe coding would be when Claude Code compacts and loses its mind."
David Guttman @latentspace · 2026-03-18 view on x