10 quotes from AI researchers about benchmarks, models, and evaluation
"Base models were scoring extremely low on ARC V1 like sub 10% basically... performance of base LLMs on V1 stayed very very low even though in the meantime we had scaled up these models by 50,000x."
"The moment models started performing well on ARC1 was with the first reasoning models, in particular the OpenAI o1 and then o3 models, which by the way they were demonstrated by OpenAI on ARC because it was the one unsaturated reasoning benchmark that was really showing that this model was different, that it had new capabilities."
"What you can do to solve ARC2 is you ask your reasoning model to make more tasks like those in the benchmark and then you try to solve them, you verify the solution, you fine-tune the model on the successful reasoning chains, and you can keep doing this millions of times. The new paradigm in AI is basically that any domain where you have verifiable reward signals, you can run this kind of loop and brute force mine the entire space."
"When it comes to competency, there is always a trade-off between intelligence and knowledge. If you have more knowledge, if you have better training, you need less intelligence to be competent. The models do not have higher fluid intelligence per se. It is just that they are way better trained."
"The fact that you need humans to engineer these harnesses is also a sign that we are short of AGI today, because if we had AGI, AI would just make its own harness, it would not need to be told how to solve a problem. Harnesses do not get us closer to AGI in any sense, but it is a very valuable area of research because that can lead to task automation at scale."
"ARC V3 is completely different. We are trying to measure agentic intelligence. It is interactive, it is active. The data is not provided to you, you must go get it. Your agent is dropped into a new environment and it is not provided any instructions, it is not told what to do, it is not told what the goal even is or what the controls even are and it must figure out everything on its own via trial and error."
"ARC V3 is designed to be more resistant to the same kind of harness strategies as what we saw for V2. We have deliberately tried to create a private set of environments that is significantly different from the public set. The public set is meant to be substantially easier. Performance on the public set is not representative of how well the system would perform. That makes it a better test of fluid intelligence as opposed to a test of how much effort you put into cracking it."
"I do believe that when we create AGI, retrospectively it will turn out that it is a code base that is less than 10,000 lines of code and that if you had known about it back in the 1980s you could have done AGI back then using the compute resources available back then."
"If you just try to extrapolate from the current rate of progress and the amount of investment that is going into not just the LLM stack but also side ideas, side bets that might work out, I think we are probably looking at AGI 2030, early 2030s most likely."
"ARC2 signaled that there was this new set of capabilities emerging. The benchmark did a really good job at capturing the advent of reasoning models and then the advance of agentic coding, this new paradigm where if you have verifiable rewards then you can basically fully automate the domain."