Local LLM's for Coding have gotten insanely good, Devstral Small 2507, OpenAI's GPT OSS 20B, and Qwen 3 Coder 30b are put head to head using some new tools all running on my RTX 5090
Links:
🧑💻My Recommended AI Engineer course is Scrimba:
https://scrimba.com/the-ai-engineer-path-c02v?via=GosuCoder
My Links 🔗
👉🏻 Subscribe: https://www.youtube.com/@GosuCoder
👉🏻 Twitter/X: https://x.com/GosuCoder
👉🏻 LinkedIn: https://www.linkedin.com/in/adamwilliamlarson/
👉🏻 Discord: https://discord.gg/YGS4A
There has been so much going on in the AI coding space that I haven't had the time to really deep dive Quinn3 coder 30B A3B which is a mixture of experts model that is honestly freaking killer. Now I just want to start with by playing a little bit of a game. First thing I'm going to do is show a couple examples of some UI. This one's pretty ugly, not going to lie. Uh, the prompt here is a personal portfolio for Kosu Coder, blah blah blah. We have this one, which starts to look a little bit better. Talks about some of the games I've worked on. And we have this one, which also looks awful. And then we also have this one. Let me show you a refresh here. So, it's got like a nice little animation here. Apparently, that's me. But regardless, there's some like overlap issues and and things like this. Now, all of these were done with local models, but in my opinion, it's pretty easy to tell that this is by far the best one by a significant margin. So,
this was done by Quinn 3 coder Q5. This one was done with the new GPT OSS 20B model with a temperature of 0.6. This one was done with that same model with but a temperature of 0.9. And this one was done with Devstral Small 257 with a temperature of 0.15. When I first started using Quinn3 coder 30B, I actually started with the Q4_0 model and it was awful. I am talking about I actually almost gave up on it. But then I went up to the Q5_0 model and it was better but still not great. So, I started researching a little bit more about tool calling reliability and I did a pretty deep dive probably over a period of like four hours testing a bunch of stuff and I'll show you one of the tools that I actually have to test some of this reliability. But regardless of that, what I think I've landed on is the underscore0 is the
worst. K_S is slightly better than that. K_M is slightly better than K_S and K_XL is the best when you have to choose. So I would recommend and this is based on what I found in testing and some research that I've done that I think you want to pick your quantization and go with the XL version. It's a it's a newer version of quantization. Back when I did a deep dive into this in the past, I really did focus on the K_M because I found it to be the most reliable, but I don't think that's the case anymore. And I do plan to do a deeper analysis on this, but I am sharing some of my findings on that. The three models that we do want to talk about are the Devstral Small 257. I also matched the Q5_K XL, but with the GPTO OSS 20B model is smaller. So I actually went to QA_K K_XL. All three of these were done by Unsloth. So if you're using LM Studio, you can find them by searching for
Unsloth and the model name. And to me, I found Unsloth to be just the most reliable in particular at tool calling. So my setup here, I'm running an RTX 5090. The model I want to talk about the most is the Quinn 3 coder 30B A3B, but I actually have testing across all three that I think are very important to put this thing into perspective. Now, what I have found is that I can typically load about 100K context into LM Studio. One of the problems that I run into though is there are other applications running. Right now, I have OBS running. I have other things that use a little bit of my video graphic memory, my VRAM. I have 32 GB within RTX 1590, but typically about 3 to four of that consumed by something else on my computer. If I reboot, I can actually raise this up to about 110 or 115 and not have it spill over into system. Just something you have to monitor. So, for example, one of the things I do is I load it up and I see
how close we get to that limit. And then I actually do a test to make sure when I run something, it's not spilling into system at all. The couple things that I'll call out that I think are incredibly important are really over here in this piece which is flash attention and then cache kc cache and vcash quantization type. You have to enable flash attention to do this. Now not every model actually does this very well. For example, the GBT OSS model doesn't seem to work with it. And I it it does load less into video memory, but it doesn't seem to actually help when it comes to performance. So, what I end up with is everything's on VRAM, but the speed is atrocious when I turn this on. If one of you knows the settings for that, you let me know in the comments below. But regardless, flash attention, KC cache, Vcash quantization to Q8 that it makes such a massive difference for the amount of video memory that you use for a very small amount of loss. And on
top of that, the models that we're going to talk about today, specifically Devstral and Quinn 3 coder, they're so good that you really can't tell a difference there. Now, one thing to note is speed is very important to me. Honestly, it's one of the reasons why I get frustrated working with GPT5, like medium or high reasoning. So, I wrote a TPS tester. It is a very simple CLI prompt that I use. You can kind of see how I run it here. I pass in an API. I pass in a key for this one. I, you know, local model, I don't actually have a needed key, so I can put whatever the crap I want in there. Now, this these are some of the prompts. I don't actually show the entire prompt, but you can kind of get the idea. The goal is to run these massive as massive amount of tokens as you can from a single prompt and then measure it. Now, this is Quinn 3 coder 30B all the settings we set above. And you can see over here when I'm streaming, I end up with a 177.42 tokens per second with a low of 123 and a high of 210. Now, another thing to actually pay attention to is this TTFT.
That is the time it takes for me to get the first token back from my GPU. And these are incredibly low. A tenth of a second. Total average 177.88. The min is 124. The max is 2108. Just absolutely amazing. The reason these are a little bit different, just to be very clear, is this one here actually calculates it including the TTFT where the ones at the bottom calculate it without it. So that's just the speed after you get the first token. That's why there's a little bit of difference there. Now, if you're not streaming, I got an overall average of 155. You can see the total number of tokens generated, 37,668 and so on. Now, for the longest time, Devstril was my go-to local model. That is now changed is by far Quinn 3 coder. And I'll show you a little bit more about why that is, but check this speed out. Now, same thing, all the settings above. This is very standard for me
around, you know, 79 to 80 tokens per second. That's still really good. But when you run these side by side, it's incredibly different. But the one thing I'll note here is look at the number of total tokens generated. 10,529 compared to 40,376, total time 229, total time 132. Deathstro small just does less here. And you can kind of tell that when you're actually coding with it a bit, too. Now, in the non-streaming test, we end up with 14,890 tokens, an average TPS of 71.47, and you know what is that? A third of the tokens. Uh, not quite 37,668 to 14,890 tokens generated. So, significantly less tokens. Now, let's talk about the GPT OSS20B model. This is a model I kind of feel bad about because I didn't give it much credit, but mainly it just did not perform really well and I couldn't figure out a good temperature
for it. But I kind of tuned it in a little bit and I think I got it working pretty well. Overall though, speedwise, it's pretty fast. 143.36 average. You can see the men at 70, but it also generated a lot less tokens, 11,89 compared to Quinn3 coder, which was like three times that. So, we can kind of see the rest of the numbers here. The big one to call out is this max TTFT. The variance here is a lot different. So, 4.2 seconds is at a high, 0.5 at a low. When you look at all the other ones that we were looking at, they're all, you know, about a tenth of a second. This one's slightly higher at 0.15ish. Now, in the non-streaming case, we have a combined overall TPS of 179, which is right on par with Quint3 code. Those are neck andneck total tokens, 16,921. So, a lot less tokens generated from the same prompts. Now, the one thing that I'm still trying to figure out is why these generate so much less tokens. I probably have to play with the prompts
more. This is an early tool that I'm building to really be able to test TPS. And ideally I could get them all to consistently kind of generate the same amount of tokens, but models just kind of do different things. So I want to talk a little bit about root code and open code. Now this is SST open code. And I think it's important to make sure you understand that both of these are very different in the way that they actually run tool calls. Root code has prompt based tool calls. What that means is the AI will actually give you text and within that text there will be some definition of what root code needs to go handle. This isn't like a native function call. This is something that is parsed out of the text from the LLM and handled by root code. So it's important that it formats it very accurately. Whereas open code on the other hand actually uses function calls. And you can see that very simply by looking in the source code, but also you can see it in the LL L LM Studio logs. You can see how they actually pass in the different
cool tool calls that are there. What's really great about testing both of these is that we get to see how it works with promptbased tool calls and how we get to how it works with native based tool calls. Now Quinn3 coder when I'm actually running it, I typically range between 0.7 and 0.6 temperature. I found both to be good. I think 0.7 is where I've actually landed for most of the things that I'm actually doing now. I think that just has worked the best, but 0.6 typically works fairly well as well. I'll see some people actually set this to 0.55. I think that might be a little too low. I would recommend starting at 0.7, tuning from there. I've actually tuned the top K, and you can see the repeat penalty a little bit and the top P sampling. These are just settings that I I have tested and kind of tweaked to get the best performance. But remember, each of these are distinct in how they do tool calls. And that's so important as I kind of get into these next tests because what I've got to thinking about is usually I've learned
so much. Usually though, I usually start in root code and I see if it works and that's what I did with the GPT OSS models and they didn't work. So I kind of gave up on them. But let me show you something. I've actually written a tool that I've been running for a little bit now. I haven't actually shown this to anyone yet. So, this is kind of the first time revealing anything about it. And I'm going to talk through some of the numbers here. But really, what I want to do is do hundreds, maybe even eventually thousands of tool calls and do it in chains. So, basically try to get 30, 40, 50 messages deep and me and measure how accurate they are at calling tool calls. I started this with simply the structural parameter accuracy which means when you're doing a tool call this is native tool call testing not promptbased tool call testing basically I wanted to see like is it actually passing in what it should you know is the JSON formatted properly is the parameters what they should be and for
destro small 25507 it's 83.9%. But overall what I found is Devstral works really good in early change. So for example, if you have I'm sending a message and then the AI needs to call a tool call or two, it's like great. It's really great in the smaller structural pieces, which is why the weighted complexity score is so low here because the higher the complexity of the test, the higher it counts towards the end end result. But I I don't think I'm calculating this well enough here. and you'll see why in a bit. So anyway, the main one I actually want you to pay attention to is the 42.5% and the 83.9%. The other ones are interesting, but I haven't tuned them properly yet. To show you something else, too, LM Studio only supports uh legacy function parameters with function call and the tools parameter with tool choice. They do not actually have the modern tool to tool choice object. Important to note because some of these AI coding tools may only
work well with this modern one, but that hasn't seemed to impacted me yet. Quinn3 coder, it is tool call happy. This is an example of a medium test running, which expects about six to 10 tools. And the way I'm checking this is I'm going here's the chat that I'm doing and I have a bunch of back and forth. Here's my tool calls that actually should happen. We ended up getting 24 here and it should have been between 6 to 10. So it is tool call happy which causes it to actually score lower overall because it does that on the lower count the the smaller simpler test it actually calls more tools than what I expect. Again I'm still playing around with this but check out the structural accuracy 100% 100%. 53.9% 53.9%. When the tool calls are expected to be in the 30s and 40s it does a great job. it it does a better job than Devstral 257. Still really interesting data. Again, I I find this
very useful. I'll be curious if you guys actually like stuff like this in the comments below. I think if I could figure out a way to automate this on a grand scale to be able to run this across a ton of different models and automate both uh basically temperature setting and everything and figure out the structural or the overall score. This could be pretty sweet honestly. Now GPTO OS20B this one is very good as well 99.4% near 100 but it's semantic 17.7%. Now semantic is a little bit different you can see it's slightly higher than the um quint3 coder one but it's going to be lower than uh devstral small. Devstral small calls the number of tools you expect until you get to the higher numbers. What ends up happening here is this one is also a little bit tool call happy and not toolk happy enough in the in the longer conversations. So here's an example of a scenario that I have here where basically I expected 11 tool calls as part of this conversation turn and I got 15. So you know you kind of
wait it that way and you try to figure out what the overall score is. Still figuring out all that stuff but I still think it's fascinating. 99.4 is great and 53.9% is great. It got me to thinking, why does GBT OSS work so poorly in R code? I mean, it's obvious, right? It's because it's not good at promptbased tool calling. So, there's really two measures here. No matter what temperature I tried with OSSGPT, I could not get it to work. It always looked like this. I actually even did play around with prompt strings and stuff. And in open code, um, you do still get errors at lower temperatures, but in open code, it actually completes. And I think that is the critical thing because it's very usable and I would say kind of solid in open code for a a model that you can run incredibly fast on your local machine. Considering not that long ago, this was literally the best thing
that I had was Devast 257. And actually 257 is an iteration. It was 255 before. But before I go into the evals, the one thing I do want to talk a little bit about is the o template override in LM Studio. So it's tricky because Devstrol and Quint Coder did not work in open code until I overrode the template in LM Studio. I have a bitly here if you want to go to that link. You can actually grab it from a drive link that I'm actually sharing. Uh 40 o of fu w ka. It will I I cannot promise you I am an expert template writer. I used AI to help me kind of figure out the right way to configure it. But once I did, I got it working incredibly well. Otherwise, I was getting errors like this error rendering prompt with Ginga template unknown string uh value filter safe. So important to note that I did actually override that. And the way you do that in LM Studio, which is also kind of critical to know, is if you go into my models and you click this cog wheel here
on the model that you want to change, you can basically add a system prop. So, let me go ahead and just load up my Deval small one that I actually did override here. You can see here, this is the place where I actually pasted that prompt. And you put that in there and save. And then it just resolved the errors perfectly. Now, on the Eval side, I've kind of I went all over a bunch of stuff. These are going to blow your mind because they actually blow my mind. Now, let's start with Devstral. Devstral Small 25507 16,754. The one thing to keep in mind here is what we aren't measuring is its overall knowledge. What we're telling it is do this particular set of work and make it function. I tell it the library versions it needs to use, the class name, the function names. So, I'm really measuring its ability to follow instructions and its tool call ability. I'm not measuring its overall knowledge of any programming language in the world. I do want to do that at some point because I think that would be fascinating, especially now
that we're starting to converge at this 26,000 mark. I think it'd be important to kind of layer on other things that we're testing. But Devstral Small 16,754 17,888 in Open Code. So, Open Code did better with Devstral Small. Remember, these are two different ways of doing prompting. Now, I did put a couple other things in here. Here's Gemini 2.5 Pro at 0.7 19,516. Isn't that crazy? We have local models that are following instructions and tool calls as well. Now, here's GPT OSS 20B with temperature 06. I have actually tested it at 09. The results are pretty dang similar. You just get a little bit better design in my opinion. But 20,210 that is insane. That is absolutely insane. Now here I actually ran the test at tempo6. You can do 07. It doesn't matter. Whatever works for you. But in these I actually did it at tempo 6. You can see here a score of 22,178 and 22,454.
When I saw these scores it blew my mind. Like it literally made me go crazy cuz like we are in the range of what I would say was probably claw 3.5 back when it came out. Like it's it's that good. The the challenge it's going to be the amount of knowledge that fits in this 30B uh brain here. So I was thinking like if I could figure out a way to get my context window up higher, get the documentation for my project somewhere where it's searchable or findable. uh mainly the libraries I'm using and and all of that stuff. I could literally see myself doing a challenge where I just code with Quinn 3 coder 30B for an entire week and I feel like it would be great. Maybe I'm overreacting here, but I feel like it could be good. Uh also note, I was only able to do these at 100K context window just based on my GPU. I really would want this closer to 150 uh 160 ideally if if I
wanted to use this like actually for real code. If we look at Quinn 3 coder cerebrus with a temperature 06 like look at that. This is insane to me. Like we're nearing the big version of Quinn3 coder running on Cerebras and we've got the Root Code Quinn 3 coder uh at 25,898. That is insane. I'm just blown away by this. You know, the number isn't everything. Context window is a lot lower here. Knowledge is a lot lower here, but it follows instructions so well. But that doesn't mean anything unless you actually use it in a real codebase. So, let me show you something that I did. I took a file because one of the things I like to test is like, is this something I could actually use? It's freaking awesome. I said, analyze this file. give me three examples of things to change. It gave me three great examples. All three are reasonable. And I said, "Good. Go ahead and do number one, which
is basically extract some hard-coded values." I wanted to do that anyway. I thought it was great. And it did it perfectly. Look at this. No tool call failures. And it's fast because we're getting 150 to 170 to 180 sometimes more tokens per second. And it didn't format my reformat my entire page. Like it just changed what it needed to change. It's great. Honestly, I actually want to go and spend Monday just running through using that model. But I need to tune it to figure out a way to get more than that 100k context window because if it spills over into system memory, literally the tokens per second go down to like two. Like it is awful. I'm being a little bit dramatic there, but it is it is so slow when you do that. I'll just close out here with a couple different demos. So, this is a first-person shooter that Quinn 3 Coder made, and it worked pretty dang good. The camera gets a little janky. Um, I can't actually shoot. I don't know if I can actually kill anything, but I thought it was pretty good for it to
one-shot, you know, this and actually get 3D graphics working for a local model. I've never been able to do that before. So, this is a interactive music visualizer I had at OneShot for me. Hopefully, this doesn't mess up my recording on OBS at all. But regardless of that, I thought it was kind of neat. Uh, it does have some issues with this drop down here, whether it's white, but in general, I think it's works pretty dang good. You know, it seems to have a trouble with like different colors and contrast and stuff like that, but this is all done locally. It's so freaking good. I'm just amazed with this thing. I can't tell you how excited I am when I started running through this and actually doing real code work with it in my in my actual codebase and then at the same time it scored so well on the eval throw in some of these additional testers that I'm actually doing and actually be able to prove that it is doing an incredible job tool calling.
I'm blown away. Anyway, I'm going to stop there. I've gone on long enough. Let me know in the comments below if you've had a chance to try this model. If there's settings or configs that I should try or other local models I should try. I don't know if there's anything that can compete with Quinn3 Coder 30B on the market right now that'll fit on my card, but I'm happy to try it. Anyway, everyone, sorry for getting a little hyped up on this one, but until next time, you guys take care. Peace out.