Gemma 4 Is INCREDIBLE! Google's Open Model IS POWERFUL! (Fully Tested) — benchmark.space

WorldofAI

Gemma 4 Is INCREDIBLE! Google's Open Model IS POWERFUL! (Fully Tested)

2026-04-04 13min 34,529 views watch on youtube →

Channel: WorldofAI

Date: 2026-04-04

Duration: 13min

Views: 34,529

URL: https://www.youtube.com/watch?v=KW5SFt3rgKo

Gemma 4 is honestly one of the craziest open model drops we’ve seen. In this video, I put Google’s latest models through real tests not just benchmarks, but actual workflows. We’re talking frontend generation, agentic tool use, multimodal reasoning, and even running these models locally at speeds that shouldn’t be possible.

🔗 My Links:

Sponsor a Video or Do a Demo of Your Product, Contact me: [email protected]

🔥 Become a Patron (Private Discord): https://patreon.com/WorldofAi

🧠 Follow m

Google is finally back with their new open-source model, the Gemma 4 series, a brand new family of AI models that are designed for advanced reasoning and aentic workflows. These models are released under the permissive Apache 2.0 license, and the core focus here is intelligence per parameter. In simple terms, smaller models are performing much like larger ones, with some even outperforming models up to 20 times their size. With this release, we get four new models. the two billion parameter model which is an ultraefficient built for mobile and edge devices. Then there's the 4 billion which is stronger for edge performance with multimodal capabilities. Then the 26 billione model which is highly efficient only activating around 3.8 billion parameters during inference. And then you have the 31 billion dense model which is the highest quality model with near top tier open model performance. Now across the board, these models support multi-step reasoning, strong math and planning capabilities and are

built for agentic workflows. That means solid tool use, structured JSON outputs, and strong coding capabilities. They also support over 140 languages and offer up to 256K as a context window. But what's really insane is the realworld performance. The fact that the 26 billion parameter model can run on something like the Mac Studio M2 Ultra, which is already a few years old, can still push around 300 tokens per second, which is just incredible. The type of quality that you can use on various sorts of device is incredible. Now, the performance that you can get from these models, especially in real time, is going to be massive for local developers and users. If you want the best AI tools, workflows, and drops before everyone else, join my free newsletter with the link in the description below, which is completely free. Now, to give you some reference points on the performance industryleading efficiency chart. You will be able to notice that the flagship Jamma model, the 31 billion parameter model, scores a 31 on the

intelligence index, slightly trailing the Quen 3.527B, which sits at 42. So on paper, yes, the Quen 3.527B is marginally more capable. But here's the trade-off that actually matters. The Gemma 4 uses roughly 2.5 times fewer tokens in terms of the output tokens for similar tasks. That means significantly better efficiency, obviously lower cost and faster generations in real world use. So the question becomes, is the 3point gain in intelligence worth burning two to three times more tokens? Benchmark-wise, the 31 billion parameter model is the one that delivers strong results in almost every category. On MMLU Pro, it scores in 85.2. On the math benchmarks, Excels GPQA. And on live codebench, even scored in 80% despite its overall size. It's currently ranked number three among open models on the LM Arena leaderboard and shows strong multimodal performance as well. Overall,

this leap is pretty massive over previous versions and is now competitive with or even beating much larger models in reasoning and coding. Pricing is pretty solid as well if you're using this in the cloud. 31 billion parameter model comes in at approximately 14 cents per a million input tokens and 40 cents per 1 million output tokens. Now, to get started with this model, you can easily do so through the Google AI Studio to test it out completely for free. You also have the ability to access the API from the Google official API docs. You can use it through open router as well as through Kilo, which I would highly recommend using this model with this harness because the Kilo CLI, an open-source harness to access this model is going to be able to use the Jante capabilities of the model quite well. So, I would recommend this. And they also offer $25 worth of free credits, which is worthwhile, especially if you're using it via API. And remember the weights are open, meaning you can easily install this for whatever operating system you have based off the requirements you fulfill. So you can

easily install it using these different methods like Olama, hugging face or even something like LM Studio. To start off, we're going to be testing out the flagship model, the 31 billion parameter Gemma 4 model on a front-end task by asking it to create a Mac OS styled operating system where it is going to be using Kilo as the harness, which is something that I personally believe is the best harness for fully bringing out the model's agent capabilities and tool use. And the fact that I'm doing this completely for free due to the free credits that they provide is going to be incredible. There we go. Thanks to Kilo's harness, we're able to get this generation. You have the main loading, which is something that the Quen 3.6 didn't do. And overall, I like the background that it added. You have the toolbar, which is perfectly generated. The SVGs are a little lacking, but still, it did a great job with this toolbar. It looks pretty great. You don't have these functional components like the Quen 3.6, but still, regardless, the fact that this model did a great job with this type of generation with a 31 billion parameter model is

insane. None of the folders are fully generated, which is a downside, but still you have functional apps. And you can see that all of these components don't fully work, but they still look really great. And they are cloning exactly what a Mac OS does. I'm not able to click into the pictures, which is unfortunate, but still it had coded out all of these components, which is nice to see. You have a calculator, you also have a terminal, and then you have the settings app. And in the settings, I doubt anything actually works. Like for the appearance, I'm not able to change it. So that's the only downside. But overall, I give this a 7.5. But since this is a tiny model, and the fact that it's able to generate this type of quality, I can push it up to an eight. Next, using the Kilo CLI, I tested another front-end task. But this is more of a complex generation for the Gemma for models that uses strict design rules, coding, and interaction constraints. This is what the Gemma 4 31B had generated and I'm quite surprised as to how it generated this

despite its size. The fact that it was able to do quite well with its egentic execution, instruction following and ability to produce high quality production level UI code is quite incredible and this is quite comparable to something we saw previously with the Quen 3.6 as well as the Opus 4.5. Now take a look at what the 26 billion parameter gem 4 model generated which is still pretty decent. And the fact that I can run this model on my computer is insane. Obviously there's a couple wonky parts over here with the sliding animation and the dynamic movement but still the fact that it's able to process multiple typographies, dynamic movements and structures is incredible. Here I am trying out another generation with the 26 billion parameter gem for model and it also did a really great job with this generation. Obviously there's a lot of wonky parts with the dynamic movements, but the fact that it got the basic structure down is the most important thing. You can reiterate on these other components afterwards. You're not

looking for one-shot generations with these models, but the quality is definitely there with these smaller sized models. Next is where I had requested it to create an F1 donut simulator. And this is a test that focuses on complex visual simulations, physics like motion, which it didn't perfectly nail, and 3D rendering in raw browser code. So, this is definitely a great generation despite its model size. And overall, it did a great job with its creativity and technical depth, but it is not comparable to something even like the Quen 3.6, which did a remarkable job yesterday. And I highly recommend that you take a look at that video because that model is truly underrated despite its pricing. It is something that does quite well with almost every component from spatial reasoning to code generation, especially front end. Now, I'm testing it out within the arena. This is where I compare the Gemma 4 31B to the Quen 3.6 Plus. And I forgot to mention this, but you can use these models completely for free and test them

out within the battle mode. Now, these two generations are pretty remarkable because both of the models are able to build an interactive UI system with the state management where I'm telling it to create a 360 degree rotation zoom hotspot annotation of a product viewer. And you can see that in this case, it added even a shadow, which is something that I haven't seen models actually generate. The UI is pretty basic, nothing too extraordinary, but the fact that I'm able to change the different colors of this product is great. You also have the ability to click on the features which is also nice. Now in terms of SVG, it is decent but not the best I would say cuz in this case the butterfly looks really good where I told it to create the butterfly and it even animated it. But overall with the other generations like the PS5 controller and SVG, it did do a great job with the overall structure with 31 billion parameter model but still it's not accurate and it doesn't depict what a real PS5 controller does. But in terms

of generating something like an SVG painting, this is where it did exceptional. It's able to showcase the ambience, also the wind flowing through the trees, which doesn't accurately uh depict how trees move, but still it's able to do a decent job with the overall generation over here. Front end quality wise, it is pretty decent. And the fact that it's able to generate clones like this of something like Airbnb is pretty incredible. It looks exactly like what the Airbnb website looks. and it did a great job in depicting even the SVG generation of all the icons as well as the formatting. Unfortunately, no Minecraft clone with this model size because it's not able to do that just yet. But it did a great job in handling this game logic of building a carboard. There's real physics simulation. There's real-time interactions with the system. You have game rules. You have all the components that it had generated perfectly. The state management, the rule implementation like turns scoring as well as a smooth motion mechanics is

great in this case. So I really like what it generated with this game. Now along with this Gemma 4 release, Google also dropped something really interesting called agent skills through their overall Gemini app. So on the mobile device for example, it is going to let anyone with any device experience where you can input different skills and have the smaller Gemma 4 model for example have it reason through them and use them for different use cases. Now what makes it crazy is that everything is running entirely on the device. There's no cloud, no external compute, just your phone handling this model, which is just mind-boggling to me. In this demo, you can see that it's able to use different tools, chain them together, and execute multi-step tasks. So, instead of just answering a question, it actually decides what tools to use in what order and how to actually combine the outputs. For example, you can query it to pull in structured data from your phone, have it process it, and even generate something like a visualization all in one flow. This is

basically a full agent system running locally and it is something that is going to be function calling using multiple tools and it is going to help you use AI at a higher standard on your own device. And surprisingly since this model is multimodal, its reasoning capabilities with it by comparing multiple image for example or extracting shared patterns and understanding visual context is something that I didn't really expect for this model due to its parameter size. It can analyze, parse, synthesize insights across images, not just describing them. It is something that can even enable deeper visual reasoning tasks from your phone. If you like this video and would love to support the channel, you can consider donating to my channel through the super thanks option below. Or you can consider joining our private Discord where you can access multiple subscriptions to different AI tools for free on a monthly basis, plus daily AI news and exclusive content, plus a lot more. Overall, the Gemma 4 is a model series that proves that open models are now hitting a point

where efficiency, agentic workflows, and local performance actually matters more than the raw size. and at the level of its capability running on devices as well as my own computer. The future of AI is clearly shifting towards faster, cheaper, and local systems. I'll leave all the links that I used in today's video in the description below. But with that thought, guys, I hope you enjoyed today's video and got some sort of value. Make sure you go ahead and take a look at our second channel. We're constantly posting AI news over here. Join the newsletter, join the Discord, follow me on Twitter, and lastly, make sure you guys subscribe, turn on notification bell, like this video, and please take a look at our previous videos so that you can stay up to date with the latest AI news. But with that thought, guys, have an amazing day, spread positivity, and I'll see you guys fairly shortly. He suffers.