Best AI coding Agents with some interesting results | Sonnet 4.5, GLM 4.6, Grok Code, GPT 5 Codex — benchmark.space

GosuCoder

Best AI coding Agents with some interesting results | Sonnet 4.5, GLM 4.6, Grok Code, GPT 5 Codex

2025-10-01 25min 35,769 views watch on youtube →

Channel: GosuCoder

Date: 2025-10-01

Duration: 25min

Views: 35,769

URL: https://www.youtube.com/watch?v=sslJ9ovlfhM

There has been so much in the month of September. I do my best to put them all through the testing gauntlet and share the results here. Fortunately Sonnet 4.5 and GLM 4.6 made it just in time to include!

Some massive surprises, and exciting times ahead.

Links:

🧑‍💻My Recommended AI Engineer course is Scrimba:

https://scrimba.com/the-ai-engineer-path-c02v?via=GosuCoder

My Links 🔗

👉🏻 Subscribe: https://www.youtube.com/@GosuCoder

👉🏻 Twitter/X: https://x.com/GosuCoder

👉🏻 LinkedIn: https://www

All right, so when you're laying in bed having trouble sleeping, I want you to lay there and try to count every AI coding agent instead of counting sheep. That should help you to fall asleep because there is so many of them now. I am working through how every month I can actually post a poll to have the community help me figure out which ones they care about the most because it changes from month to month. Like this month for example, Droid came out which is Factory CLI's new AI coding agent and a lot of people were excited to see this one. We also had GitHub Copilot uh the CLI come out which honestly I is not worth testing. Just going to be totally honest with you, I left this one off. I evaluated it. It's too early for me to actually be looking at this. In fact, at the time of actually trying it, I couldn't even pick the model. So, as everyone knows, what I am really measuring here is instruction following. I care a lot about if I tell an agent to go to completion and do these particular things that it does it.

It It to me it's like uh this isn't going to be the end-all be-all test, right? You're going to have um your own feelings for like how it communicates, how it talks back to you. Those things matter a lot, too. But these tests are primarily built around if I have an incredible spec and I have an incredible suite of tests, can it get to 100%? And I will tell you in this particular month, there were a few times that tests that had never been hit at 100% actually got hit 100%. So, it's kind of crazy. I do have an LLM as a judge that accounts for about 30% of the score that is ran multiple times and then we take the median of that. The The variety of that is that is actually pretty low. So, for example, it might come in um it might come in aggregate everything together and be between like a 3.5 and let's say a 3.8. It's a very close uh scoring there. But it's very consistent. If there is a bad if there is a bad entry, it will

very consistently score that lower. So, it does help me actually assess the quant the quality of it. The way that's actually built is it's done giving like good examples and bad examples and it also gives it information around like how it actually did around passing unit test and the linting and all that. So, it gets it gets a ton of information to actually evaluate the uh the score from the LLM as a judge. Here's an example. So, none of my tests are these simple, "Hey, go do this one particular thing." They're all about multiple files. Like can you go and edit and work an entire code base? Like a lot more tokens. So, like this is one that I've had to unfortunately deprecate, but I love showing it because it gives you an idea of what I'm doing. So, this is in the actual spec I give. This is my deliverables, expected files. So, notice this isn't just filling out a function or two. This is actually building an entire application. The same thing happens on the editing applications. It

is more than just changing a single file or implementing a single file. It's all about trying to get between eight and 20 files. That's the range that I've set in my head for this. Um I think I need to actually bump that up to be totally honest with you because these things have just gotten so good. Some notes, I do want to talk a little bit about Zed. I'm not going to touch on that too much here, but Zed you're going to see do well, but it does way more than you asked it to. So, you're going to need if you do want to try out Zed, I'm going to warn you that if you give it a spec, it is going to do way more than what's in that spec. It's going to be up to you if you like that or not. There were two new models added this month with GLM 4.6 and Sonnet 4.5 right during testing. So, that threw everything through a loop. I was going to actually do Qwen 3 Max. I had that on the list and I was going to actually do Kimmy K2. I needed to bump and change some things around to be able to fit GLM 4.6 and Sonnet 4.5 because I feel like that's more important for this month at least to get like a snapshot of where they are at this point in time.

Now, GPT-5 Codex and other agents is uh kind of sketchy. I'm going to be totally honest with you. I was expecting better, but I think it makes sense as I kind of talk and talk about some of the numbers here. Now, the models of course Claude 4.5 Sonnet, GPT-5 Codex. I did spot check high. I had not a lot of high, mostly medium. High is just so slow and a lot of people that are saying Claude 4.5 Sonnet is fast, it has to be that all of us that feel like it's fast are coming from GPT-5 Codex. I will say though in some of my tests, the token efficiency of Claude 4.5 Sonnet is around 20% less tokens with higher scores than Claude 4. So, I mean it's or I shouldn't say higher. Higher or similar scores to Claude 4. So, there is some improvement there. I did do a couple GPT-5 spot checks which we'll see in the final chart. GLM 4.6 is something I ran wherever I could uh and I typically if for that one, I actually have a API directly to Zed AI. To be very clear

there, where possible, I used Zed AI my personal account to actually run those. And then of course Grok code fast because apparently Grok code fast is like 75% of the total coding tokens and I I want to keep an eye on like why people are using this so much and it turns out it is still mostly free and a lot of these agents, so that's why it's actually using being used as much as it is. So, Claude 4.5. Now, this is going to be interesting. Claude code 26,930, phenomenal score. Like it crushed it. But is there a perceivable difference between 25,000 and 27,000? It's hard to say to be honest with you um because like my analysis of using Claude 4 and Claude 4.5, they have the same quirks. I would say Claude 4.5 even has some uh different quirks that were kind of annoying. For example, I was working on an issue and I kept like telling it, "No, I don't think that's right." So, it

would like ping-pong back and forth between two solutions rather than like letting me have a conversation about where I wanted it to go which was somewhere kind of in between the two solutions, if that makes sense. So, I would be like, "No, I don't think that's right. Maybe we should consider this." "Oh, you're right. I reverted back to the other solution." That's not what I wanted. So, it's a little annoying. Um I do think over time I'm personally going to get better at talking to it. I may need to actually figure out how to change some of my prompts and just learn that model a bit. Uh so, I'm just calling that out that it scored well, but if you were to tell me I was using Claude 4 or Claude 4.5, I'm actually not sure I could realistically tell the difference. There are people that are saying it feels way faster and I think some of that has to do with parallel tool calls because like when I'm when I get the parallel tool calls going, it does feel faster. But overall, I don't feel like it's that big of a step up. Warp as always, well, I shouldn't say as always. For the last two months, it's just done incredible with Claude models. Basically identical with Claude code

here. I mean we're talking just run time variance. And Zed AI, you know, I mentioned earlier Zed AI, you got to be careful with it. It did top the charts for some crazy reason this month, but holy crap, I gave it the spec and it took forever. It just runs and runs and runs. And it would say, "Okay, one more thing. I'm going to create a document you didn't ask for." "Okay, one more thing. I'm going to create a document you didn't ask for." And they would just keep doing that over and over again. So, by time it was done, it had created so much stuff I did not have in the spec or ask for. This is done in the right mode just to be very clear. That actually I started considering maybe I need to change the scoring to like negate points if it does more than I ask it for. Does that make sense? Anyway, it's interesting to me because I want to like Zed AI, but like just testing it, I was actually very annoyed with it because of how long it ran. Anyway,

we'll talk about that a little bit more that later. Overall, there is not much difference between Claude Sonnet 4.5 between the bottom and the top. I will say, you know, like Augment CLI, I don't know why Augment is just always, you know, on the lower end, but I will say the ones that are actually at the top do a much better job of looping over itself over and over and over again to make sure it's fully working. The ones at the bottom, Droid, Augment CLI, Win Surf, do a pass, maybe do a little bit of testing, and then stop. Where some of the ones up at the top just kind of keep grinding. So, I get asked a lot like what is the difference between a you know, a 24,000 and a 27,000 here? Honestly, it's like self-iteration. I have no doubt you could get any of the ones in the bottom as high as the ones on the top if it just iterated a little bit more. So, take that for however you want. Now, GPT-5 Codex. Oh, let me touch on one other thing here for real quick. So, some people actually mentioned I should have thinking on. To be very

clear, all of these tests were done except for two. I'm going to point to them here in a second. Were done without thinking on or with the default setting. But I did want to spot check a few here. So, I turned on medium thinking in Droid and I turned on medium thinking in RuCode um for Sonnet 4.5. Honestly, this goes with everything that I've always said. I don't think reasoning models are the at least for Sonnet, I don't think you need reasoning to code. At least when you're giving a spec. When you're building a plan, that's where I think reasoning works better. If you're fixing a bug, that's where it reasoning works better. So, there's I didn't find a significant difference at least in my test, between no reasoning and reasoning. So, calling that out, but I would love to know your feedback on that as well. Again, I actually really love reasoning when I'm planning or dissecting an issue as a pair programmer. Less so when I'm doing it to daily coder, which is what this is testing. It's like I know what I want to do. I

have the structure of everything kind of planned out and I'll go do it and do it exactly like I say. You don't need a lot of reasoning for that. So, it makes sense that my test do not show like crazy improvements here. And I also think that's one of the reasons maybe like GPT-5 scores like slightly lower on my scores because I've kind of the plan is in place. And what we're measuring is not so much as reasoning capabilities. I guess to some extent you are cuz you still need to implement the logic and everything. But, what we're really measuring is its ability to be your daily coder. All right, GPT-5 Codex. Now, this is a model that actually fascinating to me because on one hand I think it's phenomenal in Codex. But, it actually kind of sucks at most of the other agents. And I'll show you the overall scores here in a second. But, surprisingly Windsurf did pretty good here at 25,850. Still, I would say uh a very good score. Anything over like 24,000, 24,500, I think it's phenomenal. I think you're going to be happy with

it. Especially the daily coder. Now, medium Codex uh GPT-5 26,218. And then on the high side, we got a slightly higher score 26,734. Just phenomenal scores. You hit 26,000. It wasn't that long ago 26,000 was just never hit on the aesthetic tier of the benchmarks. Now, we've got that consistently. And I did I it's kind of odd, you know, I got to Codex as number one and two. And maybe I shouldn't have done that. I'll show you what number four is just so you can kind of see what number three could have been. But, I did want to call out like high in Codex CLI makes a decent difference. Uh not a magnificent like a huge difference, but it is super good. Uh so, number four here is actually Factory CLI. It does an incredible job with GPT-5 Codex. Like incredible. In fact, I would say it's probably one of the better models to use on Factory CLI, which is Droid, by the way. I remember I said I'm going to probably call that back and forth. Open Code it

worked great. But, let's get down to Rue Code. So, Rue Code uses a different prompting strategy. It uses uh or it Rue Code uses a different uh tool calling strategy. It uses prompt-based tool calling. Now, because of that, I don't think at least in my test, GPT-5 Codex works that great in Rue Code. Zed, it also did fairly poor in. Crush, it did fairly poor in. Cursor and Copilot, I'd say it did fairly poor in. And And a lot of times what I found is again, it goes back to how much does it self-iterate on itself or tool call failures or does it just stop, right? There there's all these different variables to it. But, you I would recommend if you're going to be using this model, you really do want to stay in like the Factory CLI or Droid or the Codex CLI. Open Code is a great one. Uh but, anything below that, I can't really recommend GPT-5 Codex at any any way. Uh at least for

now. Maybe over time that'll get improved. Another new model, GLM-4.6. One of my favorite models, one of the probably the best bang for the buck plans on the market right now. If you don't mind like sending your data to z.ai. Especially if you're a student and you're just learning, you should freaking buy that $3, I think it's $6 a month plan. Whatever it ends up being, I think they changed it recently. But, it's a really good price. Rue Code has always worked good with GLM-4.5 for me. No different with 4.6. 24,274. In fact, I would say that GLM is arguably one of the better front-end models for me. You do have to tune the temperature some. So, you can play around with it. 0.6, 0.7, somewhere in that range. So, if I was able to adjust it, I did adjust that for these tests. But, I can't always have that ability in some of the the different agents. Crush does amazing with GLM-4.6, which surprised me because of some of the other scores I'll show you here in a minute. But, Crush did great 25,082.

Um and the number one one is Zed AI, which also surprised me significantly. Again, over-engineering. It's just built into the nature of Zed AI. It's hard for me to say I can totally recommend that. But, I mean there's a huge jump up from number two to number one. So, Zed AI is I mean it it definitely loops on itself, right? Like it So, it goes it goes to the extreme of looping on itself and doing more than you asked for. So, Zed number one. Open Code over here at the bottom. But, honestly, all of these were good. I mean the difference between Open Code and uh Rue Code are just minimal here. And shoot, my labeling did not actually work good. So, just to show you this, this bottom one here is Rue Code uh GLM-4.5. I wanted to give that sort of like benchmark there to show that there was a little bit of an improvement in Rue Code with GLM-4.6. Not significant and but I think it's a more than run time variance. So, keep in mind this one is GLM-4.5. This one is GLM-4.6.

Grok Code Fast. One of the most used models right now on OpenRouter simply because it is free. Cursor actually does a great job with it at 23,270. Uh GitHub Copilot does a phenomenal job at 23,814. And Rue Code surprised me actually here with the number one spot with 24,270. And And I know Rue Code did some special things cuz I was looking through their code base at one point and they did some I try I think it was the diff tool or the edit tool. I'm trying to remember. But, I think it's the edit tool. Actually, I think they did some custom things specifically for Grok Code Fast to make their tooling work a little bit better there. And if we look here, we can see that Zed Grok Code Fast was basically unusable. I know I feel bad about this, but Open Code I found it to actually score relatively low as well. Rue Code, Copilot, Cursor, Windsurf all good. So, if you're using it in one of those, you're probably getting a fairly good experience. In fact, 24,000

you can't sneeze at that. I mean, it's a free model right now. I I would be burning those tokens like crazy if I was on a budget because those free tokens aren't going to last forever. And I think it's just important to take advantage of that while we can. Now, overall, I tried to color code these. Hopefully, I got them all correct. So, we've got orange as the Sonic 4.5. We've got GPT-5 Codex high. Like basically like I mean they're the top five or six are like basically identical at this point. Really, what I would focus on is if you look at the bottom pa- All right. So, now taking a look at all of them, Sonic 4.5 for sure takes the top spot. Uh but, not by much. Like GPT-5 Codex high is like right up there with it. And again, this is spec-based following. So, instruction following. I wish we had some good benchmarks on planning or figuring out ways to like debug like really deep complex issues to

see if that would work cuz my gut tells me that GPT-5 Codex is better at that in the in the Codex CLI. So, if you've got a spec and you're building something trying to do instruction following, a daily coder and you're telling it what to do, man, you can't go wrong with Sonic 4.5. You I don't think you can go wrong with GPT-5 Codex high. And in fact, in some cases I would say GLM-4.6 actually stands firm with them. Like it's got the spot up there in Zed. But, it does drop off, you know, depending on the agent you're actually using. The blue is going to be our Grok Code Fast. Of course, that's down at the bottom. It's a free model, smaller model, so faster model. But, honestly, it is fine, right? I actually have debated trying to use it for a few days because I haven't given it much like real effort since when it first came out. I want to see if it's improved some. Uh and then there you can see a couple of the other ones that I spot-checked in here with the red um or with the pink rather. And then the kind of like a reddish brown. You can see I did a few DeepSeek V3.1s. Quick quick note.

I actually could not get DeepSeek V3.2 to actually complete in Rue Code, Copilot, or Open Code. So, that was surprising to me. When traditionally uh DeepSeek 3.1 did actually complete in there. So, I think something may have regressed or maybe it was a provider I was using on OpenRouter. For whatever reason there, that wasn't working that well for me. Now, real quick on the newcomers, I did want to highlight this just to show you. I mean, Factory CLI, which is Droid, actually has done a great job. It came out. It's in the pack. Not extraordinarily better. Although, I will say the the the two e, the the experience of it actually is pretty good. I actually like the way it looks. It's a cool uh it's a really really nice um CLI. I just go back to the point before where I don't know how many of these things that we need. So, it's hard for me to get excited about these new ones that come out that look good, but they don't really do anything different that I've seen yet. Does that make sense? Anyway, let me move on.

All right. So, Grok Code Fast, lots of promises um or lots of promise. Some problems with tool calling, which actually gets annoying. And I actually had to throw out of all of them, again, the way I do it is like I run a bunch. If there's an outlier, I throw away those tests because just trying to remove some of the variance there. I did have to throw away more of the Grok Code Fast ones because there when it goes off the rails and tool calls start happening, it just did not recover. Talked about Zed a little bit. I don't really know what to think of Zed. Uh on one hand, I like the strategy of like just go until it's right. But it it does more than I ask it for. So, I feel like their system prompt is like very aggressive. Almost too aggressive on like I want to create this documentation. I'm going to create a testing plan. I'm going to do, you know, this this uh additional stuff that I don't really need to do. Uh I don't know. Personally, it's not my jam. I want to like Zed. I think that's going to have to be something up to you

because it did score really well. Like Warp.dev on the other hand or Cloud Code on the other hand, most of the time like it their issues are more on the they do just enough or maybe they stop stop right before, you know, it's totally iterated on itself fully. GPT-5 Codex works really good. I think it's been tuned primarily cuz I actually spent some time digging through the Codex CLI uh code base or the Codex code base on GitHub. And I really do think it's been tuned highly for that because the score differences there are just incredible. And if you're using if you're in Rue Code, I would just use GPT-5. I personally don't think there's any reason to use the Codex version of that. Uh the GPT-5 version just scores higher than the Codex version in Rue Code to any sort of like prompt-based one. I would leave that up to you. I would try out both, but I would say the GPT-5 Codex one most likely should really only be used in the Codex ecosystem.

Or maybe uh AI coding agents that are similar in tool calling uh with with native tool calling. But hopefully this has given you a little bit of understanding of that. Now, what I'll be using in October? This is This was tough. I actually sat here for like 15 minutes trying to decide like what I was going to actually be try to commit to. And I decided I'm going to give Warp a go. The reason I have not been using Warp is because of my workflow. My workflow I love working in the IDE and I love having it to where my terminal is at the bottom. And I I'm worried that this is going to be the the game-changer for me. Because um when when I have to jump out of that window to go to another terminal, it just kind of like breaks my flow. I love just kind of flowing between the terminal, the extension on the side, and the code. If I could get Warp in my IDE, I would freaking love it. I know you can't right now or at least the last time I looked you can't. But I'm actually going to try to give it a real go. I'm going to actually spend some time at least a

couple days using Warp purely for all my coding. I'm paying for a plan for it anyway, I might as well give it a go. Um and I know a lot of you are fans of it. And actually, I really love it anytime I'm doing like system admin stuff cuz you know, I manage my own infrastructure and all that stuff. It does a really good job helping me understand how to do like the different like uh AWS calls that I need to do there. Now, Cloud Code. I'm going to use this in two ways. Uh I'm already paying for the $100 a month plan. I need to give 4.5 more time to understand how to use that effectively. And then I also want to power GLM-4.6 through Cloud Code. Right now, I mainly use it in Rue Code. Um and I love it. I love it in Rue Code. But I want to give it a go cuz a lot of people talk about the magic of GLM-4.6 in Cloud Code. Now, I have tested this in the past and it does work great. But I've never actually used it for days of actual like production work if that makes sense. Now, Codex CLI it goes without saying that's honestly

my go-to right now. Uh I I feel like nine times out of 10 when Cloud Code has a problem with something, I just do it in Codex CLI and it's boom, I'm good to go. In fact, I ran into that with I talked about that uh I may have talked about this earlier with like I had an issue with Cloud Code 4.5 where it was like ping-ponging back and forth between a uh two wrong solutions and I kept trying to like narrow it in. I ended up throwing that code away. Went to Codex CLI, gave it the same prompt, boom, it was just it was good. Again, that's only one point. It's just, you know, my subjective nature of that. I do want to mention some honorable things here. Droid um man, there's just so many of these and I only have so much limited time, but I would like to give Droid a little bit more attention. And there's no way that I would go without mentioning Rue Code and Open Code. Those two are just phenomenal and you know, long-term if I had to pick just two open-source uh agents to use, it would be Rue Code and Open Code. Primarily, you know, if especially if

the subscription services weren't around. I love the configurability of this. I love how I can just put in any model. I can there's so much customizability. There's so much tinkering I can do with both of these that it just makes a lot of sense for them to be on my list. And in fact, I just going back here, most likely just in the nature of my work, you know, Warp, Open Code, and Rue Code are going to be pretty similar in usage. But I did want to give Warp a little bit more elevation cuz I haven't really committed to it the way I should. Anyway, if that's going to wrap it up, this video has gone on a lot. This has been a lot of work to put together and I know some of you're going to be disappointed that I left off some model or some agent and I'm sorry, but literally this stuff just takes so long to test. I almost would need to like back up and do like an entire week leading up to this to be able to get more into this and I would just have to give up sleep uh a lot of sleep. Anyway, I appreciate you all. Hopefully this is helpful in some way and if it happens to be helpful, if you don't mind hitting the subscribe and like button, that would mean everything to me. Till next time everyone, peace out.