Meta Just Changed Everything. Muse Spark Destroys GPT-5.4 & Gemini on Key Benchmarks. — benchmark.space

TheAIGRID

Meta Just Changed Everything. Muse Spark Destroys GPT-5.4 & Gemini on Key Benchmarks.

2026-04-09 14min 7,839 views watch on youtube →

Channel: TheAIGRID

Date: 2026-04-09

Duration: 14min

Views: 7,839

URL: https://www.youtube.com/watch?v=mOTzmb1m0Uc

🌐Subscribe To My Newsletter - https://aigrid.beehiiv.com/subscribe

Get your Free AGI Preparedness Guide - https://theaigrid.kit.com/agi

🎓 Learn AI In 10 Minutes A Day - https://www.skool.com/theaigridacademy

🐤 Follow Me on Twitter https://twitter.com/TheAiGrid

Links From Todays Video:

https://x.com/alexandr_wang/status/2041909376508985381

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights a

So Meta finally released their new AI model, Muse Spark. So let's talk about it. So Muse Spark is essentially the model we've all been waiting for when it comes to Meta's Intelligence Labs. Now essentially this is the first in the Muse family of models developed by Meta's new team. And one of the most interesting things about this model is that it is natively multimodal. So, first let's get into everything that makes this model great and then we can talk about some of the drawbacks of this model to give you guys a very balanced perspective. So, one of the first things that Meta essentially talks about is that this is one of the first models that they're going to be releasing and it is currently out today. You can try this model, but one of the key areas that you need to understand is that this is a model that is natively multimodal, meaning that it was built from the ground up to understand video, images, audio, and text. which means that on these benchmarks that you can see here, the multimodal areas do perform significantly better than many different competitors. Now, it doesn't completely stand out across every single area.

There are some areas where other models like GPT4 or Gemini 3 perform slightly better, but this is arguably the strongest area where the new Muse Spark model does excel at. Now, we're going to take a trip back to the benchmarks later, but if we do take a look at the overall stuff. So this is the artificial analysis index and this is where they combine several different benchmarks. So for example, you may have like the GPQA reasoning all of those things combined into one over many many different ones. You can see that Meta's Spark model is currently sitting behind Claude Opus 4.6 Max on this index. And the reason I think this index is a pretty good benchmark baseline is because it shows us the combination/ average of many different results, not just one specific area. So across the board, you can see exactly where the model sits. And so yeah, a lot of people might think that this model isn't as good, but we can clearly see that they've taken a huge, huge jump since they've made the Llama for Maverick. Now, this is clearly a

Frontier class model, and I'm going to show you guys some more key areas where it excels. So if you want to know exactly where it excels, you can look here at this multimodel example. So there was a website that actually went and did some extra testing about the meta model. And so this model here currently does excel at the visual area. Remember how I said multimodality is where the model currently excels. This was one of the examples that they did show. So in this example, they had the model look at an example look at a chalkboard menu from Yezis. And this is of course a pretty difficult task because it's handwritten chalk. There's some glass reflections and there's multiple different sections with multiple different prices. And then of course you can ask the model what's on the menu. Now here you can see that if we look at the responses currently Metamu Spark was able to get this correct the majority of the time compared to other models. Remember guys some people could argue that this is cherrypicking considering that Meta have of course retweeted this. But I would

argue that but I would argue that most models aren't actually natively built to be multimodal. Most of them are just simply text space. Which is why when you do have companies like Google and Meta that train their models natively to be multimodal, what tends to happen is you do get some very very effective models that have multimodal reasoning capability. And so another thing that was also very very interesting was that if you do take a look at here, this is what you call the real-time data section. Now the real-time data section was super interesting because this is one that you would think the Grock currently has the leaderboard on. Most people, including myself, use Grock for real-time data because it is just currently very up to-date when it comes to that stuff. But here's what we actually had. So, they essentially asked each model to find current stock prices for Nvidia, AMD, and Intel. And essentially, what we had here was MetaMuse Spark being the best model that was able to do that. And it was able to get all of the most up-to-date news. And this was super super interesting because if you look at the deep search QA benchmark, this is once again where

Metamu Spark actually scored relatively well. And the reason I've actually included this is because this is probably something that most people might miss. I know that of course some people are like, "What are the inherent benchmarks?" But some people won't realize that this is something that is very very key to how the model performs. Now, let's take a look at something that Meta actually does have natively that other models don't actually have. And I do wonder if this agentic thing will actually become a default in other models considering how effective it is. So take a look at this. And this is the first time I've actually seen this in a actual LLM. So some people will call this a model council. Some people call this like an LLM judge or voting arena. But Meta have essentially released something in this model called contemplating mode. So this is something that orchestrates multiple agents that reason in parallel. And this is designed to handle complex scientific reasoning queries. And in their testing, they found it competitive with other extreme reasoning models such as Gemini Deepthink and GPT Pro. So what this is is you have essentially an AI that spins

up other agents to collaborate and combine their, you know, reasoning efforts into one final judgment. And in doing that, they can actually get results that are not only better than these current models, but also more token efficient. So, you can see here that humanity's last exam, this currently looks like it's state-of-the-art, just three points behind GPT 5.4 Pro, and it's actually currently better than other models when it doesn't use tools. You can see that in Frontier Science Research, it actually has 38.3, which is currently a state-of-the-art benchmark. So I do think that multiple agents collaborating is probably going to be a theme for the future considering that most of these models capabilities are well within reach of another. There doesn't seem to be some massive, you know, crazy kind of area where there's such a large gap. And so if you actually want to take a look at the contemplating mode, you can see here how that, you know, level of ability scales with the different agents. You've got one agent here, then you've got two agents, then you've got four agents, and then when you have 16

agents, you can see that that accuracy continues to increase. They don't really talk about this that much, but I do wonder, okay, and this is just my pure speculation, that if there is some inherent scaling law with these agents, maybe the more you get agents, the more you're able to do. It will be super interesting to see how this does evolve, of course, with different architectures that are actually unhobling the gains that are locked within LLMs. And something else that was really cool was this prompt. So there was a prompt where someone literally took a screenshot of their fridge and they essentially put it into the multimodal AI and said, "Hey, I'm someone with high cholesterol." And put green dots on recommended food and red dots on a notrecommended food. And don't duplicate dots, make sure they're localized properly. And when hovering over the shot and then when hovering over the dots, show the justification and a health score along with calories, carbs, protein, and fat. And this was something that the model was able to do relatively well. And the thing is I actually tested this demo and it works just as exactly as it says. And you'd be

surprised so many times when AI companies release, you know, stuff and the demo really just doesn't work. You're not able to replicate that. But this was something that I was able to replicate pretty effectively. And I was very very pleasantly surprised with as to how good this was. So this is something that I think is very very effective. And what this does showcase is not only just, you know, some raw coding ability, not that it's state-of-the-art there, but it does show that the model does actually have good multimodal features. And again, multimodality coming into Muse Spark, what we do have is the ability to analyze video. So most people don't realize this, but most current LLMs cannot natively analyze video. The only current model that is really able to do that is currently Gemini. I think Grock in some instances maybe sometimes chat GBT kind of hallucinates but Gemini is the main one and of course now Meta's multimodal model can actually do that as well. So this is of course something that you can use to analyze video. I do know that yes there are some open source ones but this one is just a little bit more better. Now what was really

interesting is that Meta actually shared the scaling curves from their reinforcement learn training and this is where it starts to get interesting. So on the left you've got accuracy and on the right you've got accuracy on a held out evaluation set meaning that the problems the model has never seen before. And if you look at those lines you can see that they're still going up. There's no plateau. There's no flatting in and out. And that means that meta is essentially telling us that they can keep pushing performance just by training longer. And the real breakthrough here is something that they call thought compression. If you don't like how thinking models like 01 and 03 burn through so many different tokens trying to reason through a problem. Meta found that if you penalize the model for thinking too long, something weird happens. The model actually learns to compress its reasoning and it solves the same problem using fewer tokens. So take a look at this. Okay, it's thinking, it's thinking, then it gets penalized, it compresses, and then it thinks longer again. And you can see that essentially it becomes way shorter. Okay, so imagine you're writing like a 2,000word essay to explain something, right? And then someone forces you to say the same thing in 500 words. you kind of get sharper because you know if you're allowed to

then say 2,000 words again, but you're now way more efficient with each word, you can essentially cover more ground. And this is what Meta figured out. That's how to make the model do this automatically through reinforcement learning training. And you have to think about this at scale. Every token is going to cost these AI companies money. And with Meta, this is going to be billions of users and that's going to be like an insane compute bill. So this trick that they're using here, it means that the model is going to get smarter while actually using fewer words to think. So this is going to be cheaper to run, faster responses, and you get basically the same or better answers, which is very interesting. And I haven't really seen this before. So I think what this does show is that Meta is not just, you know, copying what other models have done. They're innovating on areas that haven't really been innovated on. And so when we look at just how effective Meta are when it comes to training, okay, this chart is Meta's scaling ladder. And this is, you know, where they have the family of smaller models they trained to map out how the performance improves the more compute you throw at it. So lower on the y- axis is better. The model is better at predicting code. More compute on the x-axis means that the model is

more expensive to train. Now the key bit here is those colors. So if you see those colored dots off to the right of the curves, those are competitors. So Llama for Maverick, which is a previous model. Then you've got Deepseeek and then you've got Kimmy K2. And the horizontal lines with the multipliers show how much more compute those models needed to reach the same level of performance as Mu Spark. So Llama for Maverick needed 10 times more compute to match the same quality. Deepseek needed eight times more compute and Kimmy needed three times more compute to match the same quality. So what does this mean? Well, it means that Meta rebuilt their entire training recipe over 9 months. They rebuilt the architecture optimization, the data curation, and the result is that the Muse Spark extracts way more capability per unit of compute. They can train a model at the same level of quality using a fraction of the resources. And this is when you think about it a cost and speed advantage. If you need 10 times less compute to hit the same performance, you can either train the same model for way cheaper or spend the same budget and train a much better model. And at meta scale, this translates to billions of dollars in

saving once again and the ability to iterate way faster than their competitors. And one of the other things that they actually did decide to focus on for this model was healthcare. So they actually collaborated with over a thousand physicians to curate training data that enables more factual and co comprehensive responses so that metaspar can you know generate interactive displays that can unpack and explain health information such as nutritional content of various foods or muscles activated during exercise. Of course you will have to play with this but remember it's activated now and I'll probably have a tutorial dropping later today. Now here's one thing I did want to cover and that is of course the overall benchmark. So, I think this may be a little bit sneaky from Meta because when they did drop this, they dropped this entire benchmark page that essentially had Meta's spark on the lefth hand side with how usually you drop an LLM. So, usually you drop an LLM and you know, you would expect that the model is state-of-the-art in at least one category or several areas. But in this area, the thing that Meta have done that is pretty confusing is they've made all

of Meta's they've made all of the models result in blue. And so what they've done here is kind of subconsciously tricked you to think that the model is st state-of-the-art across the board. But on further inspection, like if you actually look at it, there are several areas where meta isn't state-of-the-art. I mean, if we actually take a look at the current benchmarks, and this is, you know, a screenshot I found on Twitter that actually shows where meta is actually state-of-the-art, where other models are state-of-the-art, you do see certain areas where the model currently has its frontier. So, the reason I think this is, you know, important is because if we're going to take a look at this, it's important to be objective. That first screenshot can be a little bit subconsciously biased, but what we can see here is that Museark does do well when like I said, Agentic Search, open-ended health, and of course, multimodal reasoning, which is very good. I mean, when we look at the benchmarks, it does say that Gemini 31 Pro is currently excelling across the board. But I would say that, you know, currently we're at that point where all of these models are within maybe 2 to 5

percentage points of each other. So I would say that it doesn't really matter that much how big of a difference it is anymore as long as the models excel in their specific domain. As all of you guys know currently anthropic their area of domain is code. Gemini their area of domain is multimodality. Metas could be also multimodality but it will be interesting to see where they try and lean. And now if you are wondering does this model generate images? If you're actually using the app, um, and this is something that I was wondering, it doesn't actually have images natively embedded into the model. It will just use midjourney under the hood. So, if you're using this thinking, okay, it's probably like got a new image and video model, that is not the case. It is just midjourney under the hood. And midjourney generates aesthetic images. But do remember that aesthetic images aren't exactly the most accurate images when it comes to visual representation. So do bear that in mind when you try to use the model to generate an image.