Zen van Riel
The Ultimate Local AI Coding Guide For 2026
2025-10-21 36min 235,896 views watch on youtube →
Channel: Zen van Riel
Date: 2025-10-21
Duration: 36min
Views: 235,896
URL: https://www.youtube.com/watch?v=rp5EwOogWEw

🎁 Get my FREE local AI projects: https://zenvanriel.com/open-source

⚡ Master AI and become a high-paid AI Engineer: https://aiengineer.community/join

Video references:

- LM Studio: https://lmstudio.ai/

- Continue (VS Code Extension): https://docs.continue.dev/

- Kilo Code: https://kilocode.ai/

- Claude Code Router: https://github.com/musistudio/claude-code-router

- Demo Auction Repository: https://github.com/AI-Engineer-Skool/ai-coding-auction-demo

The future of AI coding is local - no cloud d

I run a local AI coding environment on my own hardware. And today, you can learn this skill that will make you stand out from 90% of AI engineers who are completely dependent on cloud APIs. In this master class, I'm going to show you how to set up your own local AI coding assistant from scratch. Now, the first thing we need to do is talk about hardware. I'm running an RTX 5090 GPU right now, and yes, that's definitely top-of-the-line consumer hardware, but here's the thing. You absolutely do not need this kind of beefy GPU to get started with local AI coding. I'm going to teach you the fundamentals you need to understand to find the best local AI model that actually works with your GPU, whether that's a budget gaming card, an older data center GPU, or even a MacBook with unified memory. And I'll show you how to install these models in different AI code agents like kilo code, continue, and even cloud code because yes, that's actually possible. So you don't need fancy enterprise hardware. But what you do need to understand is how VRAM works,

how to select models that fit your constraints, and lastly, how to actually make this work for real coding projects, not just toy examples. And that's exactly what I'm going to teach you today. So let's get started. And you can see right now that I'm on my Windows PC. Now, if you are on Linux or Mac, there's no need to worry because this master class is actually compatible with all operating systems. I'm going to give you the principles and the fundamental knowledge that you need to be able to run language models on any operating system and use it for AI coding. In my case, I'm using an RTX 5090 GPU and that's definitely on the top line of consumer grade hardware. But you don't need this kind of beefy GPU to get started with local AI coding. But first, we have to go through some fundamentals. Specifically, how do you actually install a local AI model for coding and how do you get it to run? Well, here's the thing. I'm using LM Studio for that. And LM Studio is a great place to get started because it offers a simple user interface to download and test models in this chat format before you actually,

you know, let it loose on your code repositories. Now, you can use things like Olama in more advanced use cases, which are more terminal based, but again, I want this to be accessible to everyone. So, in this video, we'll be making use of LM Studio. The great part is that whether you are using LM Studio or or Lama, all of these tools are very general purpose because they generally expose the language models using the open AI API, which means that other implementations that expect this kind of API format can plug into your AI models with no problems at all. And this is why these kinds of models are actually going to be fully compatible with kilo code continue or whatever new open source AI coding agent is popular at the moment that you're watching this video. So how do you actually select a language model that fits your GPU? Well, when you first install LM Studio, there will be some recommendations, but I want to give you the fundamental knowledge that you need to understand what actually happens to your GPU once you load a language model in memory. So let's have a look at the models that I have actually downloaded

here. In my models, you can see that I've downloaded quen 2.5 which at the moment is a very good coding model that's open source. It has got 32 billion parameters and it's 21 GB in size. I also have the latest OpenAI GPT open source model which is 20 billion parameters and thus it's much smaller in size. And both of these models are quantized models. So without getting too much into the the mathematics, a quantized model basically tries to reduce the size of the model while keeping it just as accurate and performant. So in this case, what we actually have is smaller downscaled versions of these two models. And that is very essential because the main limitation for being able to run local AI is generally the size of the models. Because here's the unfortunate truth. To run these local AI models, you need to load the entire model into your GPU's memory. So, this is not generally the memory that's on your computer, the regular RAM. This is the V RAM, the video RAM that's available to your GPU

as dedicated memory. So, if we check out the task manager here on the left, you can see that I actually have, let's round it off to 32, 32 GB of dedicated GPU memory. So basically for me it is very important that I select a model that fits in at 32 GB. But you will see later on that just because this model is 21 GB doesn't mean that I can actually run it when I use it for AI code use cases because AI coding is a little bit special. To do AI coding properly, you often have to load a lot of context into your model like you know a bunch of files in your repository for your code agent to actually do something useful. So you will find that actually you need to have quite a lot of VRAM in order to run these kinds of models. What is quite nice to know here actually is that VRAM is very expensive. If you try and buy an Nvidia GPU, you'll find that not many of them have 32 GB of dedicated memory. If you're looking at a great budget option, you can either go ahead and find some older chips with more VRAM or you can actually get a MacBook or Mac Mini

because these actually have shared RAM that's used in the entire system. For example, if you purchase a MacBook Pro with an M4 chip with 48 GB of RAM, this is actually an exception to the rule because you will actually have about 48 GB available as VRAM inside of that M4 Pro chip, for example. And that's the beautiful part about the M infrastructure of Apple. You actually get that shared memory allocated between all of those different components in your system. Whereas with a regular GPU like in Nvidia's case, you know, it's all dedicated GPU memory. So this is something to keep in mind. You know, you will actually have to figure out what kind of machine that you want to build for your local AI use case, but it is actually possible to run local AI on a budget if you purchase something like a Mac device. Now, that's quite a contradiction, right? Usually, it's not like that and MacBooks are actually quite expensive. But the amount of you know dedicated GPU memory is not the only aspect that just determines whether

or not you can load the AI system fully in memory. That doesn't mean that the AI system is going to actually run effectively because you could buy a GPU from 10 years ago that might have quite a lot of dedicated memory. Maybe you know an old data center TPU but it won't actually have the right cores the right speed to actually invoke the LM model at a decent speed because that's another consideration to make. Right? So enough theoretical talking. Let's go ahead and load in one of these models in my GPU. I'm going to start with a smaller model here, the 20 billion parameter model from OpenAI because actually as I'm recording this, you know, I'm actually using quite some video encoding and and you know, dedicated GPU memory already. So I just kind of want to start off with a smaller model here. So what I can do is I can go to my chat and I can go ahead and select this 20 billion parameter model. And I'm actually ticking this this box here that says manually choose model load parameters. So I can talk a little bit about the main parameters that you might need to set. So there are a lot of parameters that you can set. And if you check out the advanced settings, you can even see

that there are some advanced mechanisms like K cache as well as flash attention that you can use to actually optimize your local AI. But in this master class, I want you to master the fundamentals. So the main thing that we're going to be talking about here is the context length. So the context length determines the amount of total tokens that you can actually interact with in a given conversation. And the difficult part here is that for real AI code use cases, you know, beyond a standard to-do app, you need quite a lot of context. And this is what a lot of these YouTube videos don't tell you. They show you, oh, I'm running my local AI model on my codebase, but usually it's just a a very small codebase or they don't actually really let most of the repository load into the model. If you work on any kind of significant codebase, you will actually need quite a lot of context. Let me show you an example of a codebase that I'm using in a lot of these AI coding demos so you get a bit of an idea as to how much context you actually need in your model. So this is a demo that I'm also going to include in the link in the description where you know I have a

bit of a simple application to allow you to create sample auctions and bid on them in both Python and Java. So if I go ahead and fully refresh the page, you will see here that there are no auctions at the moment. And I can create a sample auction like so. And then I can for example, you know, bid 90 bucks. The point of this is that I actually generated a little script that calculates the amount of tokens that are in this file repository. So here I have the token analysis JSON file and you can see that this repository contains a total of 38,000 tokens. Now, that does include a couple of files that generally don't need to give to your AI agent to get a useful response. So, for example, if we were to just look at the Python source code, we will see that there are only 9,000 tokens. But I want to remind you that this is a very simple sample application. So, if you wanted to load in, you know, just half of the repository to actually implement a new feature that works with the backend properly, then you actually will need more tokens than the default that LM Studio gives you. Because if we go back

to LM Studio and we cancel out, we try to load this GBT model again, you will see that by default, we only have 4,000 tokens available. So for a regular AI code use case, that is actually not enough. It's functional, but it won't really help you in a real AI code use case. And I really want you to understand how you can actually create a local environment for yourself that does work for real local AI coding. So the first thing you want to do is increase the context length. But here's the problem. context length is not free. If you check out here on the top right, you can see that the VRAMm usage that's estimated is increasing quite a lot and it's sort of bearable for me for this 20 billion parameter model. Um, you know, because I have quite a lot of VRAM, I can actually go up to, you know, almost pretty much 100,000 tokens. This is kind of where we're getting into my own limitations with my GPU, but that's quite good, right? You can fit quite a lot of context. But this really just gets worse the larger the model actually is. Because if I cancel out and I go to this 32 billion parameter model and then

I increase the context here, well, I mean, now when I'm at 75,000 tokens, I'm already at 45 GB of VRAM. That is not going to work for my machine. So this is kind of where the model selection comes into. It's not just about the billions of parameters of the model. It's also about how much context you can actually support within your VRAMm constraints. So, you have to actually test that first. I'm going to go ahead and load in the GPT20 billion parameter model. And let's set the context to 50,000. I think that's good. And then we'll be using around 20 GB of our VRAM. And I'm going to go ahead and load the model. Now, again, in some advanced setups, it is possible that you don't have to load the entire model in your GPU, but these are really advanced use cases. For most people trying to just run local AI coding, you should assume that you need to run the full model on one GPU and load it in entirely. And you can see here indeed that my dedicated GPU memory has just shot up to over 22 GB give or take some that is being used by the system for other processes. That is about right. Now what I can do is let's

go ahead and just ask a very simple question, right? generate a Python class that represents a gym including different kinds of machinery. Now what it happens is the moment that I press go, it will actually send the prompt to the GPU and you can see that because there's not a lot of context, it's actually incredibly fast, right? You can see my usage is almost 100%, not quite, but you will see that definitely happen as the prompt context grows larger. And you can see that very quickly it just generated a class. Now, of course, it's a trivial example because this is just brand new code. It's not actually coding in a repository at the moment, but you can see that actually the model itself is blazing fast at 170 tokens per second. And I can even optimize this further. Now, on the screen, you will actually see this exact prompt being run by my MacBook with M4 Pro with 4 to 8 GB of RAM. So, the entire model fits on there as well. It is slower, but it's actually still quite

reasonable. You know, the thing is though, this is not a real AI coding scenario because usually when you have AI coding scenarios, you're working in an existing repository. You don't just start off from scratch with a conversation that's empty all the time, right? You actually load in different files inside of the AI code agent. The code agent might explore different files and just fill up its own context. So, this is actually not a very realistic example of how long it takes to generate a response. The way that I'm going to actually show a more realistic example is by providing the entire code context of that auction website in a single prompt. The way I'm going to do that is by again referring to that Python dump file in here. And in here, you see the entire content of my repository that I just kind of dumped into one single file. Specifically, this is all of the Python code. So if we keep scrolling, you can see it's over, you know, 1,400 lines of code. And in terms of the amount of tokens, just to remind you, you know, the amount of tokens is around 8,000. And one token represents

approximately 3/4s of a word. So that's kind of, you know, the the measurement, the cost measurement for these AI models. What I'm going to do now is I'm going to actually start a new chat and I'm going to zoom in the screen a little bit here because you will be able to see that there are over 11,000 input tokens used because we're adding some metadata like the file names. And the moment that I actually send over this prompt, you will see that the response is not immediately generated. Here is the problem. The language model actually has to process the prompt now. And you can see that my entire 3D usage is shooting up to 99% and only then is it able to generate a response. So, if we keep scrolling now, we can see that of course, you know, it's still pretty quick. It's just able to refer a couple of the code snippets. I can actually go ahead and ask questions about the repository, but you can see here that actually it took a lot longer already. And I'm using one of the topline GPUs. This gets a lot worse the bigger your repository is. So, for example, let's actually go and start a new chat. And now paste the entire repository three times. So, I just pasted three times.

You can see that we're using 34,000 tokens. This is not an unrealistic amount of tokens if you're working in a real code repository. And this is where most YouTube tutorials stop, right? They will just show you this optimistic use case and say, "Wow, look at how fast my AI model is generating tokens locally." But when you're working in a real code repository, you will find that the performance drops very quickly as context gets added to the language model. Because now if I run this prompt again, you can see that it takes quite long to process the prompt. Still somewhat reasonable because I've got a really good GPU, but in your case, it might take much longer. And this is a great example where a language model might be able to fit in your memory, but that doesn't mean that it's going to be performant because you will actually see that if I use a very small language model, then it will process that exact prompt much faster. I can show that to you, but I think this is an example of something you should definitely go and try out for yourself. The problem is though, if you have a model with less parameters, it's usually worse at the specific task that you give it. In this

case, coding. And coding is already a pretty complex task, right? And the unfortunate truth is that if you want a really capable AI code agent, it needs to be able to support tool calling. The only AI models that can actually call tools properly, like being able to run things in your terminal, explore files autonomously, even call external services, well, you need a pretty good model for that. Most of those models start from 20 billion parameters, but really in a lot of complex use cases, I would even advise this kind of 32 billion model. So, we're going to do a couple more tests with this 32 billion model, and then we'll move on to actually implementing these local AI models in code agents inside of Visual Studio Code. So hang in there because we're learning a lot. But now let's go ahead and load in quen 2.5 to see how that performs. So I'm going to go ahead and eject the GPT model. And you can see that immediately all of my dedicated GPU memory just drops because of course, you know, we're getting rid of the model. It's being cleaned up. And now we can go ahead and load in the Quen model. What I'm actually going to do is I'm going to

show you what happens when you try to load in a model and provide more context length than your GPU can actually feasibly support. So I'm going to go ahead and go crazy here. I'm going to go ahead and set it to let's say I think 35 GB is a good test because that is more than my dedicated GPU memory. But you will see that LM Studio will still try to load in the model and that actually gives you quite a lot of issues. So let's go ahead and load in model. And you can see that at the very beginning nothing special is happening, right? We're just filling up that dedicated GPU memory like so. Um, but now eventually we will actually hit the limit of our dedicated GPU memory and that's where we're getting. So now we're hitting that top and oh, you can see here that that my shared GPU memory is actually being used. And the problem with this is that the shared GPU memory is actually the regular RAM that is part of my system. And the problem with this is that if you load in part of this model in your RAM, it's actually going to perform quite terribly. And in fact,

you can see right now that my own video feed is starting to lag. And I'm not even going to edit that out because I'm running all of this on my machine. And I want you to understand the performance implications when you do something like this. So, I'm going to keep my laggy stream like so. It's just part of what it is now because I'm just completely, you know, overutilizing my system. But this is a really big problem. You can also see that it has a lot of trouble to actually finish loading in the model because it has to load so much of the model inside of my regular system memory. In fact, I think I was actually a bit optimistic as to uh you know this finishing in due time. So let's skip ahead to when the model is finished loading and I'll show you what the problems are of overloading your memory like this. All right, so now the quen 2.5 model is loaded in and is using too much of this shared memory. And the problem with this really is that you will see that it's super slow. So if I just say hello, which is something that the other model would just respond to immediately, you can see here that it's taking a long time. It's fully using my GPU, but other than that, because so many of the model

parameters are in the shared memory, it's just taking an incredible amount of time to load a response. And you can see probably that my video feed is going crazy again. So the important lesson here is that just because you can load in a model on your GPU doesn't mean that it's a good fit. you need to make sure that you're not using too much memory and leave some headroom to actually add new context length without sacrificing your regular computer memory because clearly it's not meant for this kind of LLM invocation. So what we're going to do is we're actually just going to load in that Quen model again and this time I'm just going to set the context length to something more reasonable for my GPU like 6,000. Let's do it like so. And that's already not that much. So in my case, I would actually prefer to use the OpenAI 20 billion parameter model because for most of my code repositories, that's really not going to be enough. But here too, it just depends on what you're trying to achieve, right? So let's go ahead and load it in. And of course, that was a lot faster. We're pretty much only using the dedicated GPU memory now. So now when I go ahead and I start a new chat, I can just say hello. And you will see that finally it's

actually generating. Now, just to show you that this is going to be a lot slower than the OpenI model, let's go ahead and use the same prompt that we did before with the gym. So, I'm going to scroll up here and I'm going to copy this prompt and in a new chat, I'm going to paste it. And now you will see that it starts to generate pretty much immediately, but it is a lot slower. That being said, this is a much bigger model than an OpenAI one. So, the code that you get from this model probably has a better quality, right? So if we go ahead and just let it generate this entire class, we can also compare the amount of tokens it was able to generate per second versus the open AI model that we tried before. So we're just going to wait until it's done and then we can see how many tokens it actually generated and how fast it was at doing so. So I think it's almost done. Here we go. So we have 1,276 tokens and it generated at 42 tokens a second. Now, if we go ahead and compare that to the other prompt with the smaller OpenI model, you can see that it was much faster at 175 tokens a second. Now, that doesn't mean that the Quen

model is X times slower. It really depends on your prompt, the size of your context window. But the lesson here is that as the amount of parameters increases, the model does just get slower. So you have to choose a model that fits in your GPU that has a good context window for actual real AI coding and that actually has a speed that is acceptable for real world use case because the truth is if I have to wait 30 minutes for simple code to be returned I might as well program myself or use a cloud model right so this is the very important fundamental knowledge that you need to actually step into the world of local AI coding but I'm sure that at this point you're just tired of seeing this specific element studio screen and you want to see me play in the real world in Visual Studio Code. So, let's see now how we can connect these local AI models to different open source AI agents that are available inside of Visual Studio Code, including a way to actually route Claude code through my own local AI model. So, to make sure that your LM Studio models can actually be reached by all of these

open- source code agents, what you want to do is go to the developer tab and you want to make sure that your local server is actually running. This way you can actually approach these kinds of LLMs programmatically and these AI agent tools they already know how to do that and I'm going to do that with the open-source tool called continue and continue is a pretty nice open source AI agent tool and I'm able to connect it with local models. So what I can do here is I can go to select model and I can click on add chat model and then instead of going for a cloud provider like OpenAI I actually just scroll down here and I can select LM Studio. Then I can just autodetect the model that's in my API, which is probably a good idea because I don't see the actual LM that I'm using. Some ones that are quite close, but not exactly the one that I have. So I'm going to go ahead and let it auto detect and it's then going to just add it to the config. And you can see here that now it has auto detected that Quen 2.5 model. And that's the one that I'm going to use because it's already loaded in. Right? So now what I'm going to do is just go ahead and ask

it to, you know, explore my codebase because I'm curious if it even works in the first place. Right? So now just to make sure that you know that it is truly running locally. When I run this command, it is actually listing files using these agentic operation and you can see that all of my GPU usage is actually being used completely locally. And now that it's done with the prompt, the GPU usage is dropping again. So just to show how easy it is to set LM Studio up with many different tools. Now I just installed the Kilo code AI agent which you can access in Visual Studio Code as well. And I can just select use your own API key. Then I can just search for LM Studio. And I can actually add the base URL. So I can find that by going to the developer tab. And you can see this is actually that base URL. And you can already see that it found my two models that I have available. But in this case I only loaded in the Quen model. And I will say with these real AI code tools like these AI agentic tools, a model that's only 20 billion parameters usually has a lot of trouble with actually calling tools properly and

understanding the structured output that Kilo requires to actually make all the functionalities work properly. So that's why I'm going for the bigger model here. So to show that Kilo code can actually work with my local models, what I'm going to do is I'm going to make it so that it adds a couple of examples to these sample auctions. Because right now if I create sample auctions there's only a limited set of them like that Raspberry Pi cluster kit and this vintage programming book collection. So let's go ahead and ask it to find the place where sample auctions are created and then let's say focus on creating at least three more examples and then I'm just going to go ahead and execute that. So now it's actually starting to generate some output because you will find that it actually already runs out of token space the moment that it just reads this one single file. And the thing is it's actually going to get into a loop now because kilo code when it notices that the LLM is almost full of context will try to condense the context. But that doesn't really make a lot of sense when there's almost nothing

in there yet, right? Like we barely had any messages with the AI assistant. So you will see that now it actually will get into a loop where it keeps wanting to read a file. you know, it keeps trying and it can't condense the context. And it's all just because we don't have that much context available in the model. It might sound like a lot, 19,000 tokens, but all of these heavy AI coding agents, they eat up your context, like it's their favorite lunch. So, it really is a problem. So, the problem is it's just already choking in that context limitation. So, what I'm going to do is I'm just going to go ahead and cancel this and introduce you to some of the optimizations that you can enable in order to fit more context into your LLM without requiring as much memory usage. Now, all these options are experimental and depending on the model that you're using, they might also just reduce performance by quite a lot. But what I'm going to do is I'm going to go ahead and actually enable flash attention and I'm going to add the K cache quantization and I'm going to set it to F16.

Then I'm going to go ahead and reload Quen, but I am going to do it by, you know, also adding more to the context length because that's what we need, right? So we had about 20K, but I really think that I need about, you know, 30,000 to really get something out of this session. So I'm going to go load the model into memory. And now let's have a look at how much of the dedicated GPU memory this is going to use. So let's go ahead and tune it a little bit. So I now set it to 27,000 tones. But to make sure that my GPU doesn't choke because it is actually reaching the dedicated limit, I'm going to go ahead and turn on flash attention as well as the K cache quantization. Set it to F16. If I reload the model with these parameters, let's have a look at how much of the dedicated GPU memory will actually be used. And you can now see that we've shaved off a little bit of dedicated GPU memory. It's not much, but you kind of want to squeeze out as much context as you can with your local GPU, right? Because as we see here, every, you know, 100 tokens is extra context that you really need for these agentic AI coding

sessions. So, because I'm so constrained with my context window on this specific machine, I'm going to give it a lot more hints. And I could just pretend like, you know, these AI agents work very well locally. But as you can see, you just need a big context window to make them operate properly. And in this case, I just have to provide way more hints where the actual code is that I want to change. So in this case, I'm simply just going to go and let it know that the sample auctions aren't here and that it should find them and then add some extras to it. Right? So the sample auctions are in here. find them and add three new ones. If you have a bigger context window because your machine has more VRAMm or in the case of, you know, Mac devices, perhaps just more unified memory, then that's perfect, right? And in a couple of years, you know, it's probably going to be a lot easier to run these AI agents fully locally, but we just have a lot of limitations right now when it comes to um yeah, these coding agents, and you have to provide more hints if you're running out of context. So, here you can see we're actually

making some good progress. it can actually find the sample auctions array. So I'm going to go ahead and find that in my VS code here. And indeed, this is where the sample auctions are defined. So now Kilo code is going to actually try and edit the file. Let's see how it does. And there we finally go an actual AI agent edit. So if we go to our git tree, we have a small change here in next.html. And indeed, we have added these three sample auctions. So now, let's go ahead and see if we can find any of them in the actual running code if I just refresh it. So, I'm just going to go ahead and reset the database until we get one of these new sample auctions. So, we have the vintage programming book collection. That's fine. It's all random. So, we might have to click through this a little bit. It's all fine. Let's continue. Mechanical keyboard, whatever. Let's go ahead and continue. and then smart home automation system. There we go. That's it. Smartome automation system that maps

to this sample auction. So that fully works and you can see that this agentic code agent does work with my local AI model. But the truth is is that your experience varies depending on how complex your code is, how much you expect these AI code agents to do for you. And depending on your machine type, you might just have to, you know, start coding only in an environment like LM Studio or in a chat only context where you are the one providing the code files that have to be changed. And this is where a lot of the more advanced terminal tools like cloud code have come in and really made the actual code editing experience really good with these AI models. The problem is though that tools like claude code use the cloud API, right? You can't use them with local models. Or can you? Well, this is where the community comes in and actually there is this great project called claude code router which allows you to use something like cloud code with any local AI model. So let's go ahead and configure that because this actually worked very well in my real world testing and it's not just for pet project. So I'm going to go ahead and

close out continue here. Let's give the screen a little bit more space. I'm going to start a new terminal here because what I want to do is I want to actually start the Claude Code router user interface. And the instructions for installing it will of course be in the description, but at this point I'm just assuming that you've got it installed because what you can do then is type CCR UI. And when you do that, it will actually start the Cloud Code router service and it will start this nice little user interface where you can configure a specific provider. And so I just removed what it was already in my configuration. So, we're both starting from scratch and we're going to add a new provider. Now, I'm not going to select the template because I want to teach you what you need to do to add LM Studio as a provider for these kinds of tools because if you're using something like Kilo Code, adding your LM Studio API is going to be pretty similar. All of these tools do have similar interfaces or ways to configure these models. So even if you don't want to use cloud code router, if you follow this part of the video, you will understand how you can add your LM Studio to

basically any open source tool like this one. What you want to do is you want to of course open LM Studio. Just double check that your server is actually running of course. And then your server will have a URL. It's going to be a local URL that you can copy like so. So in the API full URL, I can actually paste this. However, that's not actually the full URL for calling the LLM. We actually have specific endpoints to retrieve all the models, you know, to post a specific completion. But the endpoint that we care about is this one, the chat completions endpoint. This is the one that you want to add after the beginning part of your API URL. And now I can, for example, call this LM studio. And in terms of the API key, because it's a local AI model, there's actually not really an API key, but I have to fill in one here anyway. So I'm just going to type some random characters. And then the model, I'm actually just going to go ahead and copy this URL. And in terms of the model, I'm just going to go ahead and copy this model name by clicking on the button here,

pasting it like so. And I'm going to click add model. And now you can see that indeed quen 2.5 has been added. And I'm not going to add any fancy transformers here. That's not necessary. I can just click save. And now we can actually start adding this quen model to our router because cloud code actually supports different models for different use cases. like if it needs to do deep thinking, you might want to use your biggest model for that. But in this case, I'm just going to use Quen across the board. I'm just going to use it for everything and then see if Cloud Code actually works with that. So, we're going to go ahead and save and restart the service. And now, I can just go ahead and go back to Visual Studio Code because in Visual Studio Code, we now have to start the CCR service together with Claude Code. So, I shouldn't just type Claude because that just starts regular Claude Code. I need to actually connect to cloud code router. You can indeed see here for example that we're missing the API key. So this is not going to work. I'm going to go ahead and exit out of here. And then the actual command that I need to run is CCR code. Now cloud code will be run with the

cloud code router. And you can see that that warning is now gone. Now after playing around a little bit with this tool, you actually see here that it gets stuck in this loop. It keeps reading the index html file again and again and again. Why is it stuck in this loop? Well, actually, it's just because it's running out of context space really. So, what I'm going to do is I'm going to stop using the quen model and I'm going to downscale to that open AI model of 20 billion parameters, but with a really large context window size because again, the biggest problem with your AI coding performance locally is likely going to be that limitation of the context window. So, that's what I'm going to do. I'm just going to go ahead and set up OpenAI 20 billion parameters with a really big context window and then I'll see you in a second. So now I've loaded the 20 billion parameter OpenI model. And now I'm just going to replace all of the Quen references with that. So just going to add the model like so. Just going to use it everywhere because it just is a lot more appropriate when you're working with larger files like I have in my case. And this is actually why I have such a big index html file because I want to show to you that when you start

to ingest a lot of context in your, you know, AI coding tool, that's when it will usually choke when you're doing it with local models, right? Uh, so I'm going to go ahead and restart this CCR tool. And then what we're going to do is we're going to just go ahead and ask a serious question. So I'm going to go ahead and ask again for it to investigate this index html file and propose three ways to improve the styling. So I can just drag and drop it in here because I can use the cloud code UI. And then I'm going to use the plan mode and I'm going to say investigate this file and propose three ways in which you can easily improve the styling. So you can see it's reading all of those lines. It's adding all of that in the context window and now it will actually start generating the output here. So this is going to be the output of the plan that will propose to me and it will be done in just a second I'm sure. So as it generates we can indeed see bam there we go we got a plan. Let's open up the window a little bit more. And you can see here that it says a couple of proposals. So, it's giving me a couple of great proposals when it

comes to centralizing colors with CSS variables. You know, making the site more responsive. And I can just go ahead and work on this now with clawed code connected to my local AI model. So, I hope that these examples have given you a great idea of what you can actually achieve with local AI coding today. So, this master class has shown you how you can set up local AI models for coding and when to use them in different situations. The truth is, you have to be careful with your expectations because you're not going to get the full agenda coding experience like you would get from the state-of-the-art cloud model simply because you're probably going to run into that context window limitation fairly quickly. And even if you don't, because you have a lot of VRAM, you might realize that your local model is becoming very slow once the context fills up. So, the best strategy here is to use local AI coding for simpler scripts and small software projects. But when you're working in a complex software environment, you might still need to dedicate some of the work to a cloud state-of-the-art model instead. If you want to learn more real AI engineering like this, you should check

out my AI engineering community in the link in the description below. And I hope to see you there. Thanks for watching.