16 quotes from AI researchers about benchmarks, models, and evaluation
"I remember trying frontier models on some of those [ARC-AGI] puzzles getting strange results and then kind of working my way back to can you just describe what the starting state is. And then I was kind of like, oh, well, no wonder it can not do the problems. It can not see the starting state accurately."
"One of them is in grounding in particular. Grounding referring to like segmentation and detection, traditional tasks. But if you want to say in your example, finding the starting position in ARC-AGI, the segmentation portion of that chain of thought is pretty unsolved cuz there are so many different things that you would want to measure, see, and have pixel-perfect representation of."
"We maintain playground.roboflow.com where you can do SAM 3 versus Gemini versus Claude Opus and what is really funny to me is that I will find these failure cases and then like kind of report them to our team and then they do not reproduce and it is actually not because of our use of the models, the model itself does not reproduce the same way."
"We introduced a benchmark at NeurIPS called RF100VL, Roboflow 100 Vision Language. We evaluated Gemini and SAM 3 and OpenAI and a number of multimodal LLMs. The best model at the time we published the work was Gemini 2, but that is 12.5% across all domains. The gap of how far these models have to go on segmentation is enormous."
"The zero-shot performance was 12.5%. We ran a competition at CVPR on 20 dataset subset. If you had few-shot, that is 1 through 5 image examples, how much do you see the models improve? The lift there, I think maximally was around 10% for a single model. Meaningful especially when you are starting at 12%, but not a panacea."
"In visual AI in particular, the US has almost never led, whereas in language, we have consistently been ahead from closed models and open models alike. The importance of manufacturing and vision in manufacturing and the importance of manufacturing in the Chinese economy, these are all trends that tell you why focusing on visual understanding is probably a high priority for them."
"RF-DETR is the first real-time instance segmentation transformer as well as the fastest and most accurate for doing pixel-wise segmentation and detection. At the 2XL size, if you do a fine-tune, it is more accurate than if you fine-tune SAM 3, and 40X faster."
"Heuristically, I see maybe an 18-month delay between a SOTA capability from a multimodal cloud available model to something that you can get to run on an edge device, which maybe here we could define as maybe like a Jetson Orin level computer."
"SAM 3 is the best open vocabulary model globally and Meta are the publishers of it. One thing that people dunk on Meta about is their lack of language models, and again, under-credit how good Meta has consistently been at visual AI in particular and advancing computer vision consistently."
"There is a model GLM recently that we are really excited about. It can run real time, you can query it with, Hey, how much was my salsa from this receipt? Or from this Google Street View image what is the house address on the left? And it is able to visually reason and extract and pull the correct answer almost always. Something like that feels closer to a solved problem."
"What we did with weight sharing in neural architecture search is rather than train a separate model for every accuracy latency configuration, we use weight sharing in NAS to basically train thousands of subnetwork configurations in parallel with a single training run. That was a huge freaking unlock for us to be able to use our compute budget efficiently."
"If Meta were to stop publishing open source tomorrow, if Nvidia were to start, all of open source would take a hit because a lot of improvements are taking the best of ideas and experimenting, running ablations, smashing them together. It is something that tells the story of what is going on in open source vision."
"The types of problems that models can recursively improve against is you can benchmark it. And the second you can benchmark it, then you can scale a bunch of compute and the bitter lesson takes hold. Aesthetics maybe is a little bit eye of the beholder of what is good, what is bad. It lives outside the range of where toss compute at it get better results."
"I think we are approaching the ChatGPT moment for vision, and the infrastructure to power all of that is coming online, which means you are about to see a Cambrian explosion of all the places and consumer expectations are just going to be disappointed absent the ability for folks to have visual understanding in the products and services we use day-to-day."
"The post-training that is applied to these problems in a lot of the labs, it is a little bit hearsay, but seems to be increasingly common knowledge, are not as interested in just like solving the segmentation problem. They are interested in solving like what was the user intent and if segmentation is a tool call as a part of that intent."
"RF-DETR retook state of the art for the US in a very specific area of important tasks, real-time object detection and real-time instance segmentation. Before that, you had models like LW-DETR and the RT-DETR family of models, which are great work out of labs in China published."