OpenAI's GPT 5.4 in 10 Minutes: 1M Context, Computer Use, Coding Gains, Benchmarks & Pricing — benchmark.space

Developers Digest

OpenAI's GPT 5.4 in 10 Minutes: 1M Context, Computer Use, Coding Gains, Benchmarks & Pricing

2026-03-06 10min 7,159 views watch on youtube →

Channel: Developers Digest

Date: 2026-03-06

Duration: 10min

Views: 7,159

URL: https://www.youtube.com/watch?v=MwATr76kFXs

The video reviews OpenAI’s newly released GPT 5.4, highlighting access tiers (GPT 5.4 Thinking in ChatGPT Plus/Teams/Pro/Enterprise and GPT 5.4 in the $200/month tier) and API availability. It covers major upgrades in reasoning, coding, agentic workflows, and state-of-the-art computer use, including an upfront plan you can steer mid-response and up to one million tokens of context. The script summarizes benchmark improvements over GPT 5.3 Codex, noting OSWorld Verified at 75% vs humans at 72.4%,

All right, OpenAI has just released GPD 5.4. So, in this video, I'm going to go over the blog post, touch on some of the highlights, as well as the key capabilities, touch on things like how much it costs, where you can get started with it, and then also show you some examples of what the model can actually do, especially within some coding context. First up, what's out today? There's two new models. Now, in terms of where you can access these, so GPT 5.4 thinking is available in Chat GPT plus teams pro and enterprise, whereas Chat GPT 5.4 4 is only available within their $200 a month tier. Now, you will be able to access both of these from the API. The model has big advancements in reasoning, coding, and agentic workflows, but some of the things that are most interesting are in and around computer use. Now, one interesting aspect is they mentioned that GPD 5.4 can provide an upfront plan of its thinking and you can adjust midcourse within its response while it's working and arrive at a final output that's more closely aligned to what you need without additional turns. It almost reminds me a little bit of interleaf thinking that originally came out from anthropic where what it could do is between the thinking

steps is it could actually invoke tools whereas what you can do with this is you can actually almost as if it was a tool midcourse is you can steer the model. So that's a really interesting UX aspect of the model. Now one really big one that I think a lot of people will be happy with is this model supports up to a million tokens of context and it also has state-of-the-art computer use. And finally, there's a ton of other aspects within this release. They do have improved tool search capability similar to Anthropic having the ability to search through different tools within how they've configured MCP support within things like cloud code. And then additionally, they mentioned that GPD 5.4 for is the most token efficient model yet, which is often times underlooked when you look at all these different benchmarks. Is it's one thing to actually perform a task really well if you brute force it and throw a lot of compute at it, but it's a whole other thing to actually have a model that's token efficient because ultimately at the end of the day, even if a model is more expensive, if it can perform the task with less tokens, it can potentially be a lot cheaper. Now, in terms of some of the benchmarks, so it's an incremental improvement over GPT 5.3

codecs in a number of different areas, but one that I find particularly interesting is OS World because this benchmark in particular, it has now surpassed humans in how well it can perform this task. Humans, I believe it's 72.4% whereas with OSorld verified on GPD 5.4, it is at 75%. And additionally, this model is quite a bit better across a number of different tasks like browse comp, web arena, as well as what we see within here. And in addition to that, we can also see some pretty big improvements across the board. Now, the other thing that I do want to touch on is knowledge work. So, building on GBD 5.2, the general reasoning capabilities, they mentioned that GPD 5.4 delivers more consistent and polished results on real world work. That's another thing. It's one thing to take benchmarks, but it's a whole other thing to actually benchmark against real world knowledge work. And that's one thing to see labs focus on is actually the real world task. As it's one thing to be good at a number of these theoretical benchmarks, it's a whole other to actually be good at a job and fill out the actual core capabilities of

different facets of those jobs. Within the blog post, there are a number of different examples such as spreadsheets, documents, and presentations. So, here are some documents that were generated between GPD 5.4 4 as well as GBD 5.2. So you can see within each of these on the lefth hand side for instance with GBD 5.4 you can see just subtle improvements across a number of different areas like being able to highlight different things or create different tables and it just overall looks like a better representation almost like something a human would make. And within the presentations for instance if you just look at some of the subtleties on the left hand side GPD 5.4 before it does look like at least in this example that there are some nice aspects of the defaults of what the model generates. Whereas on the right hand side, especially if you've worked within trying to build a web app or working with one of these models within trying to build something like your presentation, you can see it's overflowing. It's using sort of random colors. It's not really biasing towards an overall design system. It's just generating things as it goes. Whereas on the lefth hand side, it looks much more professional and cohesive. and just to visualize OS worlds. It's a pretty big

leap, as you can see here, over GPD 5.2. But the other thing that I really want to focus on within this are just some of the practical tasks. Now, one thing that I've been particularly focused on lately is actually having a browser do a lot of the different tasks that I have at hand. For instance, I've had different tasks like sorting through my email, finding different subscriptions that I could cancel and go through and actually perform those tasks or go through all my past YouTube videos and update the description. There's increasingly more tasks that are within the realm of what you'd have to do within a browser that those are things that I'm now beginning to be able to delegate to agents. And that's the thing that I'm really excited about with the latest models that have come out since Opus 4.6 and now models like GPD 5.3 codeex as well as GPD 5.4 is increasingly these models are able to perform very real functions in the world. They're able to use browsers. They're able to use computers. They're getting to the point where they're very effective and you can spawn them off and do a ton of useful work and automate a ton of different things. Next up in

terms of coding, GPD 5.4 combines the coding strengths of GPD 5.3 codecs with the leading knowledge work and computer use capabilities. That's one thing with this model is you really do get a workhorse that's good across a ton of different tasks. If you wanted to write code, it can do that. If you wanted to leverage a browser or have computer use capabilities, it can do that. And when we compare this to GPD 5.3 codecs across a number of benchmarks, it's a pretty considerable leap. For instance, on low reasoning where there isn't the latency where you have to wait for the model to think, it is quite a bit of an improvement. And where we really see it start to shine over GT5.3 codecs is on that medium reasoning level. Now, just a few examples. The model does appear to be much better at front-end task. Here is an example where it built a theme park simulation game. You can imagine just how complicated the logic that would be involved in creating something like this. This is an example of something that I built. Additionally, it built an RPG game. So, similar to one of those old Nintendo style games where it has all of these different component pieces built into the game. It has all

of these different interactive cards and different moves that you can take within the game. You are going to be able to check all of these out. I'll put these within the description of the video where if you want to play around with some of the games that it generated, you can see here is one where it's flying around the Golden Gate Bridge and there's this 3JS type of simulator where it's going through it has cars that are interactive. You're able to interact with the environment and you have this very rich representation of what it generated within a web app. So overall, there's a ton of different capabilities. If you're creative and you have these types of use cases, you want to make games, you can do all of that. Additionally, there's a great video within here of one of the developers that focused on the front end and kua work within the application. Another really cool thing within this video in particular that he showcased was what you can actually do that I didn't realize within codeex is if you take a wireframe of a front-end app that you want to build is you can actually convert that and leverage image generation right within the codeex desktop application. And now additionally a ton of people have had access to the model. Lee Rob from cursor mentioned that GPD 5.4 for is currently

the leader on our internal benchmarks. Our engineers find it to be more natural and assertive than previous models. It works through ambiguous problems without second-guessing itself and it's proactive in paralyzing work to keep things moving. So basically across the board, the reactions that I've seen have been overwhelmingly positive with this model. I'm going to be diving into some more complex use cases in some subsequent videos, actually testing this out and building some things and trying some things out. If you're interested in that, just stay tuned to the channel. So next up are availabilities and pricing. So as I mentioned, you can access this within codeex chat GPT from the API. If you're going to be leveraging it from the API, you can use these different model strings. So GPT 5.4 or GPT 5.4 Pro. Now, additionally, in terms of pricing, the one thing to consider is if you are going to be leveraging that million tokens of context window on input is requests that exceed 272,000 tokens of context are going to be charged at 2x the normal rate for those tokens that are exceeded. Now, in terms of the $15 per million tokens of output, now this model is quite expensive. It is

$30 per million tokens of output and $180 per million tokens of output. So much much more expensive if you do want the state-of-the-art. But for those that are working on particularly hard problem that aren't cost-sensitive, just know that that is an option. And just to put this into perspective, so Claude Opus 4.6, six. The flagship model from Anthropic, it is $5 per million tokens of output, $5 per million tokens of input, and $25 per million tokens of output. And Sonnet 4.6 is $3 per million tokens of input and $15 per million tokens of output. And then additionally, for both of these models, the max output, if you really are going to try and stress test the model to see the output capability, you're going to be able to get $128,000 tokens of context. Now just to put it into perspective. So when we compare it to some of the other models that are out there, this is probably something that a lot of people are thinking about. How does this stack up compared to Anthropics Opus model for instance? So across a number of different benchmarks that they decided to highlight. We can see that this model outperforms Opus 4.6 basically across the board on a number of these different tasks. Additionally, it's improved

across math as well as sciences. So the model overall does stack up quite well across a number of these core tasks when compared to some of the other models that are out there. Now additionally there is a new fast mode within codecs. GPT 5.4 will be able to run at 1 and a half times speed faster with the same intelligence and reasoning. And that's one thing to note with codecs is they recently partnered with Cerebrris where a lot of their latest model releases they do have an option that is also focused on speed which is definitely a really nice thing to see. But otherwise, that's pretty much it for this video. I'm going to do a more in-depth video actually showing and demonstrating this within Codeex. Kudos to the team at OpenAI for another great release. But otherwise, if you found this video useful, please like, comment, share, and subscribe. Otherwise, until the next