Meta's NEW Llama Replacement - Muse Spark — benchmark.space

Sam Witteveen

Meta's NEW Llama Replacement - Muse Spark

2026-04-09 16min 15,477 views watch on youtube →

Channel: Sam Witteveen

Date: 2026-04-09

Duration: 16min

Views: 15,477

URL: https://www.youtube.com/watch?v=7vkybiVRSm0

In this video, we look at the latest release from Meta AI, Muse Spark, which seems to be their internal replacement for the Llama models.

Blog: https://ai.meta.com/blog/introducing-muse-spark-msl/

Demo: https://meta.ai/

Twitter: https://x.com/Sam_Witteveen

🕵️ Interested in building LLM Agents? Fill out the form below

Building LLM Agents Form: https://drp.li/dIMes

👨‍💻Github:

https://github.com/samwit/llm-tutorials

⏱️Time Stamps:

00:00 Intro

00:14 Meta AI: Muse Spark Blog

00:30 Llama 4

01:42

Okay, so in this video I'm going to cover the new model from Meta. This is called Muse Spark. Although it sounds like this is probably the model that everyone's been referring to as avocado over the past couple of months. And this release is kind of confusing on a number of different fronts, which I'm going to go through and look at both not just what's bad about the model. And clearly the model is lacking from what people were hoping was going to come out. But also where Meta could actually be going with this kind of model. Now, as of this week it is literally 1 year since Lama 4 got released. And Lama 4 was such a big letdown. It literally came out on a Saturday afternoon San Francisco time. And honestly at the start I thought it was a leak or something like that. It wasn't an official release. But sure enough there was this website announcing it. Now back then they only made the smaller versions of these models available and they promised that the behemoth model was in preview and was going to come in the future. Well, that

never came. Certainly not to open source. And from what I understand very few people got access to actually use it and to try it out. And I got to laugh that as I came in to basically just check when it was released and stuff like that. This was the top search result for Lama 4. Unmatched performance and efficiency, which I thought was quite funny. But as I clicked through sure enough some of that content seems to have been taken offline. Now jump forward about two or three months from that Lama 4 release. Not only had it become obvious that Lama 4 was a big sort of fail. Mark Zuckerberg went on a spending spree to basically build Meta super intelligence. So they spent up I think it was about 14 billion dollars to pull in Alexander Wang into Meta putting him in charge of this brand new division called Meta super intelligence labs. And then he went on a hiring spree where some of the pay packages were reportedly hitting hundreds of millions of dollars if you count the equity etc. Well,

yesterday that team finally shipped their first LLM that they've working on for the last 9 months. It's called Muse Spark. And honestly the story here really isn't the one that Meta is trying to sell everybody. The real story here is about what money can and can't buy in AI right now. So let's jump in and take a look at it. Okay, so a quick bit of history here. I covered all of the Lama models. This channel literally started because of Lama 1 and me basically explaining it to people and getting asked about it so often that I made a video about it, which became the early version of this channel. Now I was a big fan of Lama 1, Lama 2, 3. But Lama 4 was clearly a big thud over that summer. And Zuckerberg basically decided to blow up the entire Meta AI org and start over. Like I said before he paid 14 billion dollars for a stake in Scale AI, which brought in Alexander Wang. And then they went on

this spending spree to basically try and catch up to OpenAI, to Google, to Anthropic. And the goal there was to basically poach the best research talent that they could get and bring them over. Now some of those people left for Meta. We saw some big exits from OpenAI. We saw a couple of interesting exits from DeepMind. And you've got to understand that this is something that Zuckerberg personally was involved in. Literally one of the researchers that I know who's one of the top people in reasoning talked about how Zuckerberg spent a long time on the phone trying to convince him to leave the lab that he was at to come on head up reasoning at Meta AI. So jump forward to today. We've got this model out and we can finally see where it stands. Now not only did they poach people but they've also had quite a number of sort of defections where people who did join that team early on felt that it wasn't going right

and actually up and left. And I find it kind of interesting that some of those people who've left it sounds like it was such a traumatic experience that they've actually removed all mention of them actually being there from their LinkedIn. So of course all of us have been waiting to see okay, what actually got released and what was the quality of this release. So first up and most important here I think is that this is a proprietary model. This is not an open model. Now Alexander Wang does kind of hint that bigger models are already in development that things are going to get better exactly what they said with Lama 4 and that they plan to open source future versions of this. But currently there isn't even an API that we can actually use. And I think that underlines the key thing here. This model is not built to be an open model. It is built for Meta to use. And we know that along with Meta super intelligence going out and buying up researchers and expanding their team they also bought Manas. And it seems in hindsight that

they paid another 1 to 2 billion dollars for this startup, which probably at the time of acquisition was actually the best example of a multi-user agent system that was doing really well. Now if you don't know the difference between single user agent systems and multi-user agent systems I did a whole video about explaining that. Check that out. But this acquisition clearly signaled that Meta thought that part of the future was that Facebook users etc. were going to want to have their own agents. And they wanted to get in on this both through the harnesses and the actual sort of system of building an agent by acquiring Manas and through developing their own models, which we're first off seeing is Muse Spark here. Okay, now let me stress this is not a bad model. Had this model come out late last year, everyone would be raving about this model. And in many ways this model is certainly going in the direction that Meta actually want to have for having general purpose models for using on their platforms, whether

that's WhatsApp, whether that's Facebook etc. And for having their own multi-user agent system that they can run at scale. Now where the model just falls down is that this model now is coming out after so many other good models have come out, especially some of the open models that we've seen recently come out and be extremely strong. Now as this is basically a publicity release here, we don't know anything about the size of the model, we don't know anything about its training data, we don't know anything about how many tokens it's been pre-trained on etc. And unfortunately this is becoming more and more common even with open models. We're not getting the full sort of recipes that we were getting from open models a year or two ago where people would tell you straight up this is how many tokens it's been trained on, this is the rough example of what the data mix is like. None of that is here. Really what is here is talking about how this model is built for personal super intelligence.

And personal super intelligence here seems to mean both replacing something like ChatGPT, but also rolling out a personal agent system which users are going to be able to make use of this model with the kind of harness that Manas has brought to the table here. Now in their release they've got their benchmarks. For me the one that's actually way more interesting when we look at the benchmarks is a user on Twitter here basically posted the same benchmark data but actually color coded so you could see who was the best model and who was the worst model for each of these. And we can see sure enough that this Muse Spark thinking and I should point out this actually comes in two modes, an instant mode and a thinking mode. And supposedly there's like a deep think mode which is rumored to be called the contemplating mode is on the way. But you sure enough you can see here when we're looking for sort of the pure intelligence we can see not only is their model winning on only three of the benchmarks, they're also actually last on three of

the benchmarks including the humanities last exam here with tools which is the benchmark that Alexander Wang's own company Scale AI actually created. Now that said, I don't want to beat up on them about the benchmarks. I actually think it's really honest of them to publish this. And we can see that the majority of the time they're not the best, they're not the worst, but they are competitive. And you can see that with some of the multimodal reasoning stuff here. But interestingly when we come and look at some of the third party analysis of where people have benchmarked it. So artificial analysis does their own benchmarks. And they point out that this is sitting in the top five models that they've benchmarked. It sits ahead of Claude Sonnet, ahead of the GLM 5.1, MiniMax 2.7. And really only behind the models from the top three proprietary labs being Google Gemini, being OpenAI's GPT 5.4 and Claude Opus 4.6. They also point out that the model is actually quite token efficient for its intelligence

level and stands out pretty strong and is what they're saying the second most capable vision model that they've benchmarked. Unfortunately for Meta or even the team at Manas who's joined Meta, it doesn't look like the model stands out very strong on agentic performance in here. So I'm going to say I certainly don't think that this model is the fail that Lama 4 was. The big question though is is this ever going to become open or is this literally just going to be the Facebook and the WhatsApp model? And if this were to become open, do we actually get the base model? It is interesting to see in their press release that they talk about over the last 9 months we rebuilt our pre-training stack with improvements to model architecture, optimization, and data curation. Now, that could be a huge deal here. The fact is if they can now train a base model which is very high quality for a lot less compute than where they were before, this allows them

to keep iterating and to get into that sort of velocity of research that the big labs have where they're able to constantly be pushing out new models, trying different things out, learning from what works, learning from what doesn't. Along with pre-training, they've obviously gone heavily both into scaling up the reinforcement learning elements, which really is what we're seeing driving a lot of the improvements both from the proprietary model providers in the Bay Area, as well as some of the Chinese models from companies like Kimi, like GLM, like MiniMax, etc. Another interesting thing in here is that they're definitely pushing the whole safety aspect. Now, I don't usually cover a lot of the safety stuff because I kind of find that all these companies are just paying lip service to that, but I guess it does make sense that Meta, or rather Facebook, would want to have a model that is not going to cause them serious reputation harm by helping people create a weapon or cause self-harm, etc. Okay, so first up when

you come in here, you'll see that Meta's clearly making a play to get you to connect other context to their app. So, just like with Claude or with ChatGPT, with Gemini, you can connect different things. Here they've got it where you can basically connect Google Calendar, Outlook, uh Gmail, or Outlook mail. This is very similar to your ChatGPT, your Gemini kind of instances and stuff in here. Except in here we can now pick either the instant model or the thinking model going through this. They also have a very cool way of allowing you to generate images, although it doesn't seem to be working for me. And you can see sure enough we can basically just come in and use this and we get a sort of standard answer out like we would from any other sort of modern model here. So, we get the the actual sort of like, you know, little mini prompts of where it's doing the thinking, but it doesn't seem that we get access to any actual thinking or even thinking

summaries. So, this is one of the things that's different, I guess, with ChatGPT, with Gemini, with other things. In those ones at least you're getting summaries of the thinking part, not just getting the main thing out. Here we don't seem to be getting access to the raw thinking, but we also don't seem to be getting summaries of that raw thinking. Now, we do get an interesting answer out here though where basically gives us some things about its thinking, understand, plan, synthesize, iterate. It then generates a very nice little sort of diagram or artifact going through this. And here you can just just like with, you know, Claude where you can break these things out, they have things here obviously that tran- didn't translate well to breaking it out vertically, I'm guessing. But we can see looking at it here, we've got something quite nice. So, here we can ask it for the actual tools that it has access to. This seems quite good actually where it will

actually give us a list of what it's got. We can see in here not only the the sandbox that it's running a very old version of Python in, but we can also see that it's got access to OpenCV, scikit-learn, we've got, you know, PDF reading stuff, etc. in there. We can see visual grounding, it's got stuff for that. And we could can see it's got the ability to spawn an agent. Okay, so if I ask it to spin up some agents, it does seem to be getting that, right? Then it's going through searching, so this is a a pen that's quite difficult to get at the moment. It's going through and actually looking for it in different places. It seems to have my my location, so it knows currently I'm in Singapore, so it's looking for Amazon Singapore and stuff like that. Unfortunately, I don't see that we can find anywhere the actual list of how many agents, what are those agents doing, what are the key parts of them, that kind of thing. Okay, so this is what I was getting at earlier that

the model is very solid and probably for what Meta wants, this is a good model if they can serve it easily at scale and cheaply at scale. It's certainly going to be able to handle the majority of tasks that Facebook users and stuff like that are going to end up using it for. Whether it's the model that we'd want for coding agents or for, you know, API use, I'm yet to see. But if we do get an open weights version of this, it could actually end up being a really interesting model. You can see here it's definitely got the ability to be able to make sites, to be able to do a bunch of different ideas. It's certainly plugging the features that they want for something like this. And it's going to be interesting to see where this goes. So, overall, I do think Muse Spark is an interesting release. I think probably people are getting down on it because obviously it's not open, totally justified for that, and because it's

perhaps not the best benchmarks. But I think for what Meta wants, this is probably a good model. So, go and check it out yourself. It's very easy to just come in here to meta.ai, try it out. Obviously, Mark Zuckerberg is talking this up and quite interesting to see that Yann LeCun is is actually congratulating him on this. So, let me know what you think in the comments. Do you think we're going to get an open version of this model soon? Do you think we're going to see like an 8B or a 30B version of this? Is it something that you would be interested in getting, or do you feel that the whole world has just moved on from the Llama era now and that really the Qwen models, the Gemma 4 models, and we know that there are more releases coming from the Gemma team. We know Qwen 3.7 is just around the corner. It's going to be interesting to see what happens in these teams that are not the top three proprietary model teams. So, anyway, as always, let me know what your thoughts are in the comments and I'll talk to you in the next video. Bye for

now.