All of AI's New Models and Tools — benchmark.space

The AI Daily Brief: Artificial Intelligence News

All of AI's New Models and Tools

2026-04-10 15min 4,060 views watch on youtube →

Channel: The AI Daily Brief: Artificial Intelligence News

Date: 2026-04-10

Duration: 15min

Views: 4,060

URL: https://www.youtube.com/watch?v=20vZc0cOpOw

Overview of major model and agent launches: Meta's Muse/Spark multimodal models and personal-agent focus, Google's Gemini notebooks for shared task contexts, and open-source GLM 5.1 pushing coding benchmarks. Benchmark comparisons show GLM 5.1 and Muse leading on coding and visual reasoning while Anthropic's Claude/Mythos faced a restricted rollout over cybersecurity concerns. New managed-agent stacks and agent harnesses promise rapid prototype-to-production flows and persistent-memory assistant

Today, we are doing a whistle-stop tour of all of the new models and harnesses and tools that we got to play with AI and agents this week. Welcome back to the AI Daily Brief. One would be forgiven for thinking that this week has been defined by models that we actually didn't have access to. A huge part of the discourse throughout the week has of course been about Anthropic's Mythos, a model which it found too powerful to release in the normal way that it had been, and which right now is only in the hands of about 40 partners for some very limited cybersecurity-focused engagement. Then just this morning, as you heard in the headlines, we also heard that OpenAI planned its own staggered rollout of their new model for similar reasons, cybersecurity risks. Now, even among people who understand theoretically why these companies are doing this, there's still I think a bit of a sentiment of don't tell me about the new toys if I can't play with them. But luckily, the rest of the AI industry is not slouching at all. And in fact, even Anthropic themselves have given us something different that's still pretty powerful to play with. So, let's talk through all of the other models and tools that have been released, starting with the first big model release from

the new Meta Superintelligence Lab. Muse Spark is Meta's first new model release in over a year. It's also the first model to come from the new Meta Superintelligence Labs division, which is of course the collection of superstar crazy high-paid AI researchers that was put together last summer and brought together under the leadership of Alexander Wang, who was brought in through the $14 billion-plus partial acquisition of his company Scale. Muse Spark will be the first of the Muse family of models, with Meta ditching the Llama name and associated baggage. The Muse models are natively multimodal reasoning models, similar to Google's Gemini architecture. Meta noted that they support tool use, visual chain of thought, and multi-agent orchestration. Now, those features are at this point kind of table stakes for the current generation, but based on fairly low expectations, people were still encouraged to see them present here. Meta didn't indicate how large the model is or whether it uses a mixture of experts architecture. In fact, we don't really know at all where this model sits in the model family. Executives referred to it as small and fast, but its performance in comparison points looked closer to a mid-size or large model.

On the benchmarks, at first glance, Muse Spark looks pretty capable. It scored 52.4 on SweetBench Pro, for example, putting it within a few points of Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 for coding. On Humanity's Last Exam, it scored 42.8, which is slightly better than Opus, but trailing Gemini and GPT 5.4. Now, interestingly on that one, with tools enabled, Muse's score only jumped to 50.4, leaving it trailing all three of those major by a few points. This could suggest the model isn't as good at web search or tool use as the others, but of course this is only a single data point. The general sense you get from the benchmarks is that Muse is in the mix, but certainly not leading the pack. And you can certainly tell where Meta is trying to put the emphasis. Rather than leading with their scores on Humanity's Last Exam or SweetBench, those scores are buried fairly deep in the results table, with Meta instead leading on the multimodal benchmarks where Muse Spark excels. The model scored 86.4 on CharViC's reasoning, which is a measure of visual comprehension, which would actually have that being a state-of-the-art result, beating Gemini

3.1 Pro by six points. Muse Spark did slightly trail Gemini on an assortment of other visual tests, but the results were strong enough to suggest the model will be highly capable. Now, these benchmarks also gel with how Meta views the model's purpose. Unlike the other model companies where there is increasing focus on coding use cases and enterprise use cases more broadly, Muse Spark is designed primarily to drive personal agents. In a Threads post, Mark Zuckerberg wrote that Muse Spark is a world-class assistant and particularly strong in areas related to personal superintelligence like visual understanding, health, social content, shopping, games, and more. And interestingly, on that same note, while Zuckerberg is trying to draw a clear differentiation between the work-focused use cases the other companies are pursuing, there is still broadly even here and even in the personal realm a shift from assistant AI to agentic AI. Zuckerberg ends his Threads post by saying, "We are building products that don't just answer your questions, but act as agents and do things for you." Giving more examples of where these capabilities will be useful, Meta wrote that they enable interactive experiences like creating fun mini-games or troubleshooting your home appliances

with dynamic annotations. The model will immediately go into service driving Meta AI and will presumably arrive across their social media platforms over time. Muse Spark will function in three modes: instant with no reasoning, thinking mode which enables reasoning, and contemplating mode that performs deep research-style multi-step reasoning. Contemplating mode, however, won't be available at launch. Meta also emphasized the health assistant use case, touting that they collaborated with a thousand physicians to curate training data for factual accuracy. Now, in this case there doesn't seem to be a separate interface for health, it's just functionality that's being encouraged on Meta's existing platforms. Meta AI leader Alexander Wang argued that Muse Spark is just the beginning, posting, "This is step one. Bigger models are already in development with infrastructure scaling to match. Private API preview open to select partners today, with plans to open source future versions." One strand of the response that's been fairly consistent was basically, "Welcome back to the party, guys." To some, even though this model is clearly behind the other leaders, the fact that the Meta Superintelligence Lab was able to get it out in less than a year since that lab was formed was a

feat in and of itself. Others were just less impressed. Ethan Mollick writes, "After playing with it a bit, Meta's Muse Spark thinking is fine so far, but really doesn't match the current big three models. It is also a bit weird, like some strange language and tone, a little loose with facts, etc." After giving a few examples, he concludes, "Anyhow, it's not bad, just not the vibe level that the benchmarks might indicate. And for a first re-entry into the frontier model space, given the engineering efficiencies they achieved, it feels like a solid attempt. I'm sure we will see better from Meta in the future." Arc Prize founder François Chollet was less forgiving. He wrote, "The new model from Meta is already looking like a disappointment, over-optimized for public benchmark numbers at the detriment of everything else. Knowing how to evaluate models in a way that correlates with actual usefulness is a core competency for AI labs, and any new lab is unlikely to be successful without first figuring that out." Wang actually decided to respond to that one, saying, "We're always open to feedback and welcome any perspective on weaknesses you've noticed in the model from using it. We're quite upfront that our model does not perform well on ARB GI 2, for example, and publish those

results for the community to understand. That might reflect some areas of improvement of the model that we could focus on in the future." In general, though, Wang reports, "We have been pleasantly surprised by users' feedback on the model in areas like visual coding, writing style, and reasoning queries." Voss on Twitter, who previously did work on Meta AI, said, "Meta's latest model, Muse Spark, is actually much better than I had expected. Is it benchmark maxed? Yes, 100%. But so is every other model. Is it frontier leader in any single category? No. Is it better than I expected? Yes. I look forward to the eventual open source version. Feels like they're coming back to life. Never fade, Zuck." Now, speaking of open source, another model that we got this week that got completely overshadowed by the Mythos announcement was Z.ai's GLM 5.1. And at least on the benchmarks, it's the first open source model to overtake leading Western models on coding benchmarks. The new frontier model, which like I said is called GLM 5.1, achieved a 58.4 on SweetBench Pro, beating GPT 5.4 and Opus 4.6, who scored 57.7 and 57.3,

respectively. Z.ai also provided a mixed benchmark that included Terminal Bench 2.0 and NL2 Repo as well, which had GLM 5.1 slightly behind the two US leaders, but ahead of Gemini 3.1 Pro. Still, if those benchmarks hold, it puts GLM 5.1 in the top echelon of frontier models with a clear separation from Qwen 3.6 Plus and Kimiko 2.5. And indeed, what most people are clinging onto is the fact that this is a full open source release with commercial licensing. It's a gigantic 754 billion parameter model, so you're not going to be running it locally on a Mac Mini. Still, it gives developers the opportunity to build on top of current generation state-of-the-art models for kind of the first time. We've been tracking the apparent shift in Chinese lab strategy away from open source recently, but this release suggests that leading Chinese labs are at least still somewhat willing to give away their best-performing models. In terms of performance, Z.ai provided a few impressive examples in agents and coding. They claim that GLM 5.1 spent eight hours autonomously building a Linux desktop using a self-review loop to remove the need for human intervention. And this is kind of what they emphasized in their announcement post as well, calling the blog post GLM

5.1 Towards Long Horizon Tasks. Running Vector DB test, the model was capable of carrying out the database optimization test with significant results. The model carried out over 600 iterations using more than 6,000 tool calls to deliver 6x the performance of a standard 50-turn session. Z.ai leader Lu wrote on X, "Agents could do about 20 steps by the end of last year. GLM 5.1 can do 1,700 right now. Autonomous work time may be the most important curve after scaling laws. GLM 5.1 will be the first point on that curve that the open source community can verify with their own hands." Now, of course, whenever a company reports their own benchmarks, it's always worth taking it with a grain of salt and waiting to see what the actual vibes are around it as people get their hands on it. But at least at first glance, the model looks like a big step up for Chinese AI. It was trained entirely on less powerful Huawei chips, again demonstrating that the Chinese hardware stack can produce some powerful results. Also, coming just two months after the release of Opus 4.6 and GPT 5.4, it suggests the US continues to be only months ahead of their Chinese rivals. Leet LLMs summed up the gap in the conversation on X, saying, "Everyone's

freaking out about Claude Mythos while Z.ai casually open sourced a model built for eight-hour autonomous execution." Now, speaking of Claude and Anthropic, if you thought they were going to slow down for the sake of discussion around Mythos, think again. On Wednesday afternoon, the company announced Claude Managed Agents, which they're pitching as everything you need to build and deploy agents at scale. In their announcement tweet, which has been seen 16 million times, they write that Claude Managed Agents pairs an agent harness tuned for performance with production infrastructure, so you can go from prototype to launch in days. It seems like part of the goal with this is to close the capability gap that we've been following on the show as well. Anthropic's head of product for the Claude platform, Angela Jiang, argued to Wired that there is a quote notable gap between what Anthropic's models are capable of and what businesses are using them for. This tool is meant to close that gap. Here's how Wired describes it, which is actually one of the simpler explanations that I saw. Managed agents will give developers an agent harness, which describes all the software infrastructure that wraps around an AI model to help it work agentically or take actions on behalf of a user. In practice, a harness is made

up of software tools, a memory system, and other infrastructure. Agents made through Claude managed agent will also come with a built-in sandboxed environment in which the agent can spin up software projects in a secure setting. The product also allows developers to create agents that can run autonomously for hours in the cloud, monitor what other Claude agents are doing, and toggle permissions that allow agents to access certain tools. Caitlin Leslie, the head of engineering for the Claude platform, said, "When it comes to actually deploying and running agents at scale, this is a complex distributed systems engineering problem. A lot of customers we're talking about previously had a whole bunch of engineers whose job it would have been to build and run those systems at scale. Now that we are giving them that bit out of the box, they're able to have those same engineers be focused on core competencies of business and their product." One of the demos provided was in collaboration with Notion, with product manager Eric Lew showing how he can offload a string of client onboarding tasks to his customized Claude agent. The big point was that the agent was running natively in Notion with full access to everything it needed to complete the task. Rather than needing to spend days setting up permissions, validating workflows, and figuring out local hosting, Lew was able to drop the managed agent in using a

virtual session. The platform also allows companies like Notion to build their own agents on top of Claude and offer them externally, bringing agents to market more rapidly. Anthropic's Alex Albert writes, "Managed agents eliminates all the complexity of self-hosting an agent, but still allows a great degree of flexibility with setting up your harness, tools, skills, etc." Claude Code's Tariq writes, "Managed agents is the first agent in the Claude API that has the right mix of simplicity and complexity. Implementation details like how you manage a sandbox are abstracted, but you have a lot of control over the actual execution of the model." Anthropic's Lance Martin gave a bunch of examples of what characteristics agents being built with managed agents had. He writes, "Some of the common patterns I've noticed across examples in my own work, event-triggered, a service triggers the managed agent to do a task. For example, a system flags a bug and a managed agent writes the patch and opens the PR. No human in the loop between flag and action. Scheduled, managed agent is scheduled to do a task. For example, I and many others use this platform for scheduled daily briefs, e.g. of X/Twitter or GitHub activity, what a team of agents

is working on, etc." He also talks about fire and forget tasks, with humans triggering the managed agent to do a task via Slack or Teams, and long-horizon tasks, like Andrej Karpathy's auto research idea. Now, it's early, but some of the first experiments seem to validate some of those patterns. Jared Orkin writes, "You no longer need an engineer to run an overnight marketing analysis. You need one sharp operator in an afternoon. Set the schedule, set the guardrails, and walk away. Anthropic runs the infrastructure, you pay per session hour." Now, he points out though, the catch nobody's saying out loud, "Someone still has to tune the prompt every Friday and act on the brief by 9:00 a.m. Monday." That's a job. That's the job we staff. The agent writes the brief, the operator runs the day. Pawel Huron started working on something similar to what I was trying last night. He writes, "I built my first managed agent, surprised how easy it was. You describe what you want in plain English, the platform generates a full agent config, model, system prompt, tools, MCP servers, permission policies, all in YAML you can edit. I asked for an email reader that needs my approval before acting." Now, one thing he also notes that is not

available yet, exactly, although it's something that they're working on, is persistent memory across sessions. That means that the types of tasks that managed agents is well-suited for right now are a little bit more transactional and discrete. For example, some of the agents that I've been experimenting with recently are basically persistent learners that help with AI strategy from within Slack, which effectively is sort of an agentic version of what we do at Superintelligent, but that persistence isn't exactly well-suited to the way that they built managed agents right now. Still, there is clearly going to be a ton of people building with these tools, and I think it's going to very quickly become a core part of Claude and Claude Code ecosystem. Lastly this week, one that seems little at first, but which is a massive quality of life upgrade, Google has introduced what they're calling notebooks in Gemini. Up to now, the way you manage projects in Gemini was frankly a little weird and unintuitive. They had their Gems feature, which was sort of, but not exactly, a version of projects in the way that you would manage it in ChatGPT or Claude, but now this new notebooks functionality is much more directly that, allowing users to organize,

collate a set of resources, documents, contacts, etc. for particular tasks. Users can also build out custom instruction sets for Gemini within their notebooks, allowing them to modify the model for each different project they have. Still, Josh Woodward from Google argues that this goes beyond the normal project settings. He writes, "Most AI chatbots give you basic projects. Gemini just built you a second brain." He goes on to call notebooks some of the magic of Notebook LM directly integrated into Gemini app. Basically, you can take the resource management that you're doing in Notebook LM and put it directly in the Gemini app. Writes Google, "Think of notebooks as personal knowledge bases shared across Google products, starting in Gemini." Now, one of the common critiques you will hear when it comes to Google is that even if people like their models, the product suite is so spread out across all the different surface areas that people interact with Google through that it can be confusing and even overwhelming. It makes sense then, based on that, to see them start to consolidate, if not the surface area of the products, the transportability of the features across those different surface areas, so that

effectively any door you walk in gets you to the same room. This may not be a full model, but I think when it comes to many Gemini users' day-to-day experience, this will be an even bigger improvement than if they had released Gemini 3.3. Now, for those of you who are interested in going a little bit deeper in Anthropic managed agents, I think I'm going to do a main episode about harness engineering soon, where we'll dig deeper into that. For now, however, that's going to do it for today's AI Daily Brief. Appreciate you listening or watching, as always, and until next time, peace.