Red Stapler
Best Local Coding AI for 8GB VRAM (2026 Benchmark)
2026-01-28 9min 70,316 views watch on youtube →
Channel: Red Stapler
Date: 2026-01-28
Duration: 9min
Views: 70,316
URL: https://www.youtube.com/watch?v=m3PQd11aI_c

Is 8GB of VRAM enough for local AI coding in 2026? Running capable coding agents on consumer hardware is a challenge, but with the right model and RAM offloading, it is possible.

In this video, I benchmark 5 promising local LLMs on RTX 4060 8GB to find the sweet spot between speed and coding capability. I test them on real-world web development tasks using Aider and VSCode—from refactoring Next.js projects to full UI redesigns.

Model Tested:

Nemotron 3 Nano 30B A3B

GLM 4.7 Flash

OpenAI OSS 20B

As you already know, 8 gigs GPU is quite small for running local coding AI. Most of the model that can fit in this GPU are, no offense to the creators, not very helpful enough to do the agentic task or project scale coding. If you want a relatively smart model, you'll have no choice but to pick bigger size model, sacrifice speed, and offload part of it to system RAM. So, in this video, I have tested several AI models locally, and I'm going to share the result with you. Let's check it out. So my criteria when pick models for coding on 8 gigs GPU are they must smart enough to do small to moderately complex task such as refactoring code at project level and UI functionality changes using just prompt. Also I know that we're offloading to RAM so it's going to slow but at least they must be usable like around 5 to 10 tokens pers. Lastly they must have a high tool calling success rate. You don't want to wait for 5 minutes after prompt and find out it failed. Based on these criteria, I'm

picking model sizes around 20 to 30 billions with 4bit quant, which is smallest size that's still capable enough to handle agentic task and still able to fit into small hardware. Here are the list of contenders. The first one is Neatron 3 Nano30B, which I heard lots of good feedback about their speed and memory efficiency, especially on small GPU. However, with some complaints on code qualities. Next is Fresh from the Oven, the latest release from Z AI GLM 4.7 Flash. I heard kind of mixed reviews on this one. Both very good and not so good, but it's worth trying. Next is the famous Open AI OSS 20B. A relatively small model and smallest size on the list yet able to achieve high SWE bench verified score. Next, Quen 3 Coders 30B A3B Instruct released last year, probably the oldest model on the list, but it got lots of positive feedback, both speed and quality. And the last one is Devstrol 2 small 24B. This one is the only dense model in the list and it's going to be very slow since we can't fit it into VRAM.

However, it got really impressive benchmark score. It beats Gemini 25 Pro and almost on the same level as Claude Sonnet 4. So, I'll give it a shot. For test setup, everything is running on my mini PC. Nvidia RTX 4068 gigs with 32 gigs of RAM. For local AI server, I'm using LM Studio with each model recommended temperature and sampler settings. For the agent, as we already know, we're running on a very, very tight hardware resource. So, I will use ADER in VS Code. From my experience, this is the fastest agent and doesn't bloat our prompt when they talk with the model. Running heavily offloaded model on more demanding agents such as Claude code, Gemini CLI, or R code are going to be very slow to the point of unusable. Trust me, I tried. Now for the test subject, I have a simple Nex.js website project that saved the user's input text to browser local storage. The first test prompt is to make the model refactoring the project code to use Tailwind CSS. And here is

the result. Neatron first surprised me with their output speed despite being a 30 billion model. It's super fast. However, its thought process seems to be very chatty and the prompt endup took 5 minutes to complete despite the fast output speed and the code result was even more disappointing. I couldn't start the web server and when I checked the changes I found out Neotron deleted my whole layout file for GLM 4.7 flash since it has bigger size in our list. It's slow and barely usable. However, the real issue was their output formatting and tool calling that Ada wasn't able to understand and eventually stuck in a loop. At first, I thought it was an agent compatibilities issues. So, I tried switch to Claude Code and Gemini. It's better with still some failed tools calling sometimes, but the speed dropped significantly to the point of unusable. I spent hours trying different quants and parameter setting. Unfortunately, this model is too much for 8 gigs GPU to handle. So, I decided to drop GLM from our test. OSS 20B's

output speed is okay for its size. However, the prompt took seven minutes to finish and the model made some CSS import errors. After I fixed that, I found out it also changed the background color of the list item. Actually, I intended to use it as the next prompt test, but the model saw that the background color make the text difficult to read and changed it even though I didn't ask for yet. Our old friend Quen 3 coder is the quickest model to complete the first test using only 3 minutes to complete. The result is good. The UI stayed almost the same after refratoring except the model adjusted the width of the input field to match the size of the list items. Devstrol actually surprised me. Despite being dense model heavily offloaded, it's still usable with around 3 to 5 tokens per second output speed and it complete the job within 3 minutes too. The result is perfect. UI stayed exactly the same. No unnecessary changes I didn't ask for. And here is the summary of the first test. Next test. A simple yet a little vague prompt to change the color. The text on

the list is difficult to read. Revise the list color scheme. A good model should understand the project structure and finish the prompt quickly without unnecessary changes. Quen is the fastest one to finish the task using around 2 minutes. However, it also added extra padding bottom to improve readability which we did not ask. Devstrol came in second. Took 2.5 minutes. The result was great. No unnecessary changes. OSS is a bit slower. Took around 4 minutes with great result. Neotron, despite the highest output speed, took 9 minutes with their internal monologue, but at least the result was great. And here is the summary of the second test. The third test is whole UI redesign. Change the list items into a dragable items on a canban style dashboard with hacker terminal theme. We'll see how each model interpret the design and their capabilities on front-end coding. Devstrol took around nine minutes to finish the task. However, the UI has

poor alignment and missing functionalities. It ignored the dragable features that I explicitly specified in the prompt. So, I listed the mistake and told it to fix. Took another 6 minutes. And here is the final result. OSS took 9 minutes to complete without any issues. The result was perfect. Quen 3 surprised me with their speed. The whole prompt took only 3 minutes and the result is probably the best one in my opinion. As for Neotron, the whole prompt took around 10 minutes. At first, it seems okay, but after context starting to grow, the model hallucinated and finally decided to delete it all the UI code. So, I decided to drop it from further test. And here is the summary of the third test. For the final test, I'll ask the model

to add gradient color animation on the title text with breathing effect. Here is the result. OSS, it ignored the gradient animation part and create only breathing effect. The prompt took 6 minutes. Devstrol completely failed the test. It hallucinated and destroyed the CSS files. I tried to tell it to fix, but it resulted in another failure. Quen 3 is the only one completed the task perfectly. The prompt took around 2 minutes. And here is the summary of the final test. So based on my test, the one that managed to complete all the task with relatively fast speed is Quen 3 Koda 30B. I usually run it on 16 gigs GPU. So, this is the first time I tested it with 8 and it surprised me how usable it is. For me, this is the best AI model for 8 gigs GPU in terms of both code quality, front-end design, and speed. However, as a 30 billions parameters class model, it will still struggle with demanding agents such as Rode, Claude Code, or Gemini CLI. You'll experience

failed tool calling and significantly dropped in speed and qualities. For example, I use Quen 3 with Claude code for test 3 and the prompt's still processing after 30 minutes instead of just 3 minutes in ADA. So, you'll need to be very careful when selecting agent for small GPU. GPTO OSS 20B is also a good model. It made some small mistake, but overall it's still great. However, it's a lot slower than Quen 3, and at real world usage, you'll need to iterate through prompt a lot. So, speed is also important. Devstrol. I like it, but it has weaknesses about our front end and seldom hallucination. It's a capable model if you stick to pure logic code changes. As for Neotron 3 Nano, I don't think it's usable for agentic task. It's like driving a very fast car with blindfolding. That's all I can say. On final notes, this is a strictly 8 gigs GPU test. So, don't expect these model to vide code your whole app or oneshot your prompt. You must break down the requirement into a small task and you'll need to clear the context in chat history very often to speed up model and

prevent hallucination. And that's all for this video. If you like it, don't forget to subscribe for more AI and dev tutorials. Thanks for watching. See you next episode.