Two AI Models Set to “stir government urgency”, But Will This Challenge Undo Them? — benchmark.space

AI Explained

Two AI Models Set to “stir government urgency”, But Will This Challenge Undo Them?

2026-03-26 16min 71,232 views watch on youtube →

Channel: AI Explained

Date: 2026-03-26

Duration: 16min

Views: 71,232

URL: https://www.youtube.com/watch?v=s4tptozUJ8Y

First look at exclusive reports about OpenAI's new Spud model, and the model Anthropic think will stir governments to urgency, all in the context of the newly-launched ARC-AGI-3. What does the extreme difficulty of that benchmarks, and its quirky scoring metrics, mean for AI in 2026?

https://assemblyai.com/aiexplained

Check out my fast-growing (!) app, free to use, and code INSIDER15 for paid tiers: https://lmcouncil.ai

AI Insiders ($9!): https://www.patreon.com/AIExplained

Chapters:

00:00

Two exclusive reports indicate that there will be a qualitative leap in AI performance from each of the next AI models released by OpenAI and Anthropic. For OpenAI, this has meant shutting down the Sora app to spare computing resources for the new Spud model. And for Anthropic, makers of Claude, it has meant renewed interest from the Pentagon in reviving a deal to use Claude beyond the 6-month deadline recently set by the US government. But this video will also dive into a brand new benchmark sure to be the talk of 2026, Arc-AGI-3. I've read the paper in full, but the headline result is that humans get 100% while the best AI models currently get less than half a percent. That might or might not be news to Jensen Huang, CEO of Nvidia, who this week said that artificial general intelligence has already been achieved. Let's start though with OpenAI's erotica bot, because the news there is that that erotic chatbot is not coming out. After

presumably spending billions optimizing it for engagement, it has been shelved apparently. According to the Financial Times, which is always my source for sexbot rumors, OpenAI needs the compute for Spud. It needs to drop its other side quests to focus on AGI deployment, rolling up everything it does into one super app. Jumping across to the information, apparently even OpenAI employees had complained that Sora, with its viral AI videos, was still just a drag on the company's computing resources. In contrast, the Spud model is apparently very strong, according to Sam Altman. It will be ready in a few weeks, and it will really accelerate the economy. I know at this point some of you will be rolling your eyes going, "Oh, he would say that." But this article was a strange echo of one I had just read in Axios about Anthropic and their Claude series. Here's the key paragraph that I don't think many people have noticed. About the new Claude series, Anthropic have warned the US government officials that the next big

advance will supercharge both offensive and defensive cyber capabilities. It might even stir government agency to strike some kind of deal. If you're not familiar with the breakdown in the former deal that Anthropic had with the Pentagon, check out my recent series of videos. This article, by the way, from Axios further hints that the Pentagon might be rethinking that breach in the deal. Apparently, even after the most public part of the spat with Anthropic and the Department of War, the key negotiating official said that they're still very close to an agreement. One detail I thought some of you may be interested in is how Anthropic might get back into good graces with the government. One of the advisers to Dario Amodei, the CEO of Anthropic, is Brad Gerstner. He's the architect of Trump accounts, which provide $1,000 to every newborn whose parents enroll. The article goes on to speculate that what would happen if Anthropic agreed to part fund these accounts. Almost like a very early tentative step to universal equity. Anyway, everything so far might get you fairly hyped about the imminence

of a new tier of AI. So what follows is hopefully going to add a bit of context to all of this. Because in the last 48 hours, we got Arc-AGI-3, the continuation of a benchmark series I've covered for years on this channel. For the makers of the benchmark, as long as there is a gap between AI and human learning, we do not have AGI. I'm going to get to the paper in a moment, but my immediate response to that headline is that there is at least some gap between humans and say chimps on mathematical memory and speed. Chimps can actually track numbers that flash briefly on a screen better than humans can. So by that logic, humans aren't AGI either, a finding possibly substantiated by global events. But that aside, the Arc-AGI-3 puzzles are genuinely really fun to try. Maybe I'm sad. Maybe I should say slightly fun to try. And I love that they manage to simultaneously test exploration, planning, memory, and goal setting. For example, nowhere on screen, and for models either, does it say that

you need to move the icon around in order to manipulate the environment, or that the plus symbol, for example, will rotate the shape, the one in the bottom left corner, or even more importantly, that the goal is to make the bottom left shape resemble the shape up here. None of those goals are stated, but like in real life, sometimes goals have to be inferred or self-produced. I did have some exclusive insight into Arc-AGI-3, but when so many benchmarks are narrow, to have a benchmark that does not rely on language, memorized knowledge, or cultural cues, one that is indeed abstract, the A in Arc stands for abstraction, is for me healthy for the field. But for you though, on the details, what does the terrible performance of current frontier models tell us about how 2026 will unfold? Here are the highlights from the 21-page paper. First, what happened to Arc-AGI-1 and 2, which were saturated fairly recently by frontier AI models? They give a lovely graph in the paper showing the rapid improvement in the last 18

months on those benchmarks. If you're not familiar, instead of an interactive game, these were more static tests of pattern recognition on grids. Well, the authors make two big points. First, that inbuilt chain of thought reasoning that was publicly debuted with 01 preview back in September 2024, that genuinely allowed models to demonstrate a kind of fluid intelligence. Think on the fly, combine patterns from their training data to reach an end goal. That's part of the explanation for the saturation of those previous benchmarks. The other part of the explanation is more intriguing. The authors say that because the public set and the private test set of those benchmarks were quite similar, then any model trained on an enormous amount of tasks representing a dense sampling of the task space automatically generated for this purpose, in other words, thousands of different guesses of what the private test set might be like, could essentially game the benchmark. It's not direct memorization, it's a higher-level shortcut, a form of attack.

They pointed out that models like Gemini 3, in their chain of thought, had giveaways that their training data may have resembled Arc-AGI-like tasks, either incidentally or intentionally. Going forward, the authors argue, private test sets need to be quite distinct and out of distribution compared to publicly available demonstration data. For Arc-AGI-3, the public test set is different and easier than the semi-private test set that is tested via API, and the fully private test set that is used for the competition. It's a different distribution of tasks, far less gameable, even if AI labs are intentionally trying to mix Arc-AGI tasks into their training data. The goal of Arc-AGI-3 is to measure the residual gap between frontier AI and human-level AGI. When those potentially crazy new models come out in the coming days or weeks, what remaining gaps, deficiencies will they have compared to humans? And I would rather the authors of this paper say deficiencies rather than gaps,

because in the methodology of the paper, we learn that AI performance is clamped at 100% or the human-derived baseline of 100%. So even if one day they solve these interactive games more efficiently than humans, they'll only ever score 100%. In other words, AI getting 100% on this benchmark will not be taken as proof of AGI, or even strong evidence of it, because the most that models can get is 100%. But yet current performance is taken as evidence of them not being AGI. The benchmark is also turn-based, so the superior speed of AI models, or their better reflexes, is not counted in the test. Nor is the relative cheapness of models counted for too much in benchmark scoring here, because scores on Arc-AGI-3 are not based on how many levels you solve, but on how many actions you took to solve those levels. Also, if a model takes more than five times the number of actions compared to humans, that attempt is scrapped because of API costs, apparently. And if you try the benchmark yourself, you'll see that the levels get progressively harder.

More importantly, what you learn from level one becomes applicable for level two and beyond. Learning in level one that the plus symbol rotates the shape is useful for level two. So again, the benchmark is also testing memory, which brings me to a couple of small paragraphs on page 19 that I found fascinating. A group called Symbolica AI created a harness, which essentially involved one model controlling another. The sub-agents would produce summaries of what was going on, and the paper notes that this design constrains the context growth that was otherwise destroying model performance. Instead of getting overwhelmed with all the grids they were being sent, the sub-agent was giving these little textual summaries that allowed the orchestrator agent to maintain a higher-level plan. That approach was able to solve all three public environments. One issue, however, if you're prepping your local agent to tackle Arc-AGI-3, is that harnesses are not allowed. The purpose, they say, is not to measure the amount of human intelligence that went into designing an Arc-AGI-3 specific system. Therefore, they will focus on reporting the

performance of systems that have not been specially prepared for the benchmark, served behind a general-purpose API. You can see the only context that models get, "You are playing a game. Your goal is to win. Reply with the exact action you want to take." Notice not even a reminder to win the game in the fewest number of actions. I was actually pretty shocked that Gemini 3.1, for example, was able to score 0.37%, although perhaps I shouldn't have been, because as Tim Rocktäschel of Google DeepMind pointed out, Arc-AGI-3 is not the only unsaturated agentic intelligence benchmark in the world. NetHack, that he was an author on, is unsaturated, he says, for 6 years. Indeed, if you read the NetHack paper, there are some eerie similarities in the almost game design of these puzzles. On NetHack, Gemini 3 Pro, by the way, is the best performing model at 6.8%. Back to Arc-AGI-3, because one of the best things about that benchmark is that every single challenge has been shown to be beatable by a human with no prior task-specific training. Not sure if that quite meets the easy for human standard, because

each environment with multiple levels within it was tried by 10 humans, and it was the second best human performance that was counted as the human baseline, as the 100% level in terms of action efficiency. And there's another quirk worth noting, which is that inefficiency is quadratically penalized. In other words, if you took 100 actions to complete a level versus 10 for the human, which by the way wouldn't be allowed because it's capped at five times, so it'd be stopped after 50, but let's pretend it was allowed, that 10% efficiency or inefficiency would be squared to give you a 1% score. Now, my slight issue with that is, if you note the human baseline in green, remember that's derived from the second best human performance, you can see that's around 540 actions for this particular level, but the best human playthrough, always on the first run, by the way, which keeps it fair, is around 390. So, if we use the inefficiency scoring rubric that the benchmark does, even the second best human out of the 10 would only score around 50%. The headline for

me is that Arc-AGI-3 is a brilliant, creative, but pretty adversarial benchmark. It really would take a step change in AI efficiency and intelligence to score even beyond 50%. I suspect at some point the Arc Foundation may want to report the median human performance, not just the second best one, and what AI performance would be without the 5x capping or the quadratic penalties. With that caveat aside, I still think Arc-AGI-3 might be one of the most creative benchmarks ever created for AI. And that does seem like a great time to mention a field in which performance is definitely not low, and that is speech recognition. Because I wonder if you saw that the sponsors of today's video, AssemblyAI, have produced Universal 3 Pro Streaming. That's a speech-to-text model for, appropriately enough, agentic streaming. To put it more simply, it can handle in real time things like when you say credit card numbers, your email, or those rarer words. Now, we could look

through even more stats about Universal 3 Pro Streaming, but I could also just point you to my custom link in the description, through which you can test the model live today on your own voice. And on that note, there's two more stories I want to cover in the remaining minutes of this video. Stories that help sum up where we are in AI as we head into spring. The first comes from MIT Tech Review, and that's that OpenAI is throwing everything into building a fully automated AI researcher, one that will be able to go off and tackle large, complex problems by itself. Moreover, this new research goal will be OpenAI's North Star, apparently, for the next few years. OpenAI want to become automated AI, and they even want an intern that can take on a small number of specific research problems by itself, an intern-level AI, I should say, by September. The idea is that AI research would then become like software engineering, with AI doing the grunt work and humans just reviewing. Indeed, one of the leaders at OpenAI makes that exact analogy. Nobody really edits code all the time anymore, instead you manage

a group of Codex agents. I do want to note something, though, that even when OpenAI becomes automated AI, that does not mean automatically we'll get immediate exponential takeoff or takeover by the AI models. As OpenAI's GDP paper points out, even when it first became quicker to allow the models to draft economically valuable tasks, which humans would then edit, the speedups for those humans have historically been in the 40% range. In other words, even when that switchover occurs, where it's AI doing the research and humans just reviewing the outputs, it doesn't mean that the next day we have superintelligence. Indeed, we have very tentative evidence that even when that flip occurs, AI draft first and human edit afterwards, rather than vice versa, we are still seeing, even in engineering jobs at tech firms, a rise in openings. You can see a 50% increase in openings in the last 3 years in engineering roles at tech companies globally, from under 40,000 to 67,000

now. This could be a lagging indicator, but even Anthropic and OpenAI are still hiring prolifically. I would add that the brutal week I've had this week, seeing Claude Opus 4.6 and GPT-5.4 Extra High repeatedly screw up engineering tasks, that has served for me as a daily reminder of this fact. A flipping to AI first isn't automatically an exponential speedup. Though I would concede they did nudge me towards Framer Motion, so my rapidly improving lm.council.ai tool, where for free you can consult a range of AI models, does now have an even cooler feel to it. Look at that pop. And last, we were reminded in the last 48 hours of the risks of allowing complete agency to open claw-like swarms of AI models, when a seemingly vibe-coded hack allowed a key open-source Python library to be co-opted, essentially, such that updating it, or not catching your agent swarm doing so, would export all your secrets and keys to the dark web. Of course, many are listening to this thinking, "Well, we're going to develop

agentic claws that will be instrumental in looking out for such exploits." But while we all try to fight fire with fire, human oversight and review seems quite needed. As Nvidia's distinguished scientist, Jim Fan, notes, "Claws need shells, probably many layers of nested shells." Especially so when, as I covered on Patreon, which is also where I interviewed Jim Fan a couple years back, we now have models smart enough and unpredictable enough to hack the benchmarks they are being tested on in real time. So, we are very much in the messy middle phase of AI. It's a better first drafter than us, but its outputs are still often full of holes. There is stark evidence of clear generalization of lower-level topics across coding languages and human languages, but less so generalization of higher-level topics, such as academic integrity, as we just saw, or adaptive goal-setting, as Arc-AGI-3 shows. Instead, at the moment, we're in that messy middle, and that makes the coming year a very interesting one indeed. Thank you so

much for watching, and have a wonderful day.