Local coding with AI: It works… Just not on my laptop
Digression It's been a while since the AI have hit the public in a way that some user use it to plan vacation and do basic addition with it. That pretty much at the same time that we (developer) discover a new tool can help (to ruin the market) to allow us to work a little bit faster. I have a little of a hate/love relationship with this tool mainly because I wonder were (has humanity) we're going with our "pollution" problem. But I also love this tool because, you know, it has lowered many steps for smart people to be able to build stuff. And I love this idea that people whatever their country and knowledge are able to build something for fun or else. That for me is also the main purpose of internet, sharing knowledge, being able to learn something that is not accessible for us where we live or because of money. Another subject is privacy. So the plan was pretty basic: we got llama.cpp to get local model working has an api (basically), and we got pi (or opencode) to get the "chat" like experience. I'm using arch (btw) but all the step here are pretty much the same for any distro or os. The first step is to install all the tool we need: yay -S llama.cpp-vulkan npm install -g @mariozechner/pi-coding-agent The first command install the version of llama that is optimized for my machine, on the AUR repository you can find a huge variety of optimized version. The second one installed the pi harness. Like the rest of this little experiment running the model with llama is actually very simple, the only "hard" step is to actually choose the model. By advance sorry no big reveal here, no "blowing my sock out" website: I have chosen the model because other people have already used it, and it works somewhat great. Some website propose some ranking for model and I think this is great, but we also knew for a fact that ratting LLM's are a pretty hard tasks mainly because the nature of how LLM's work. Also, this is a little fun experiment so if we really need to choose we can always switch later on another model :) Let's get back to action: llama-server -hf AaryanK/Qwen3.6-27B-GGUF:Q4_K_M --port 8080 --host 127.0.0.1 -c 8192 -t 8 -ngl 40 I guess for the most part of the cmd you already know what is doing, but let me explain the weird one: c controls context size t controls CPU threads ngl controls GPU offload This is the most important stuff here because that the value we need to edit and test to allow to get the most performance of our "model/pc" couple. If everything is working correctly you should now have something like that on your console: main: model loaded main: server is listening on http://127.0.0.1:8080 main: starting the main loop... srv update_slots: all slots are idle Ok right after that we have to actually get the model "linked" to our chat. With pi that pretty simple: mkdir -p ~/.pi/agent nvim ~/.pi/agent/models.json # in ~/.pi/agent/models.json { "providers": { "llama-cpp": { "baseUrl": "http://127.0.0.1:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Qwen3.6-27B", "contextWindow": 8192, "maxTokens": 4096 } ] } } } Here the two most important values are: contextWindow: how much the model can remember maxTokens: response size limit We can now check if the model is actually linked by launching pi and selecting it via the /model command. Okai, my machine is a tuxedo notebook (InfinityBook Pro something) without a graphic card and certainly not a macpro chipset. So there the reality check for me: saying "what's the day?" to the model take a really good amount of time to actually get a response. And a pretty good damn chunk of the compute capacity of this pc: For me, I guess this the end. Without a good pc for this kind of compute I'm pretty much no able to use a local model, or a more smaller one, so the result are not going to be has good comparing to a Claude Code or a ChatGpt. And I don't want to set up a VPS for that (maybe another time?). Another tasks were a local model can help: code completion. I'm going to use Mellum from Jetbrain to test that. Easy has before for the setup: llama-server -hf JetBrains/Mellum-4b-base-gguf --port 8989 --host 127.0.0.1 -c 8192 -t 8 -ngl 40 and for our pi we just add another option to the model: { "providers":{ "llama-cpp":{ "baseUrl":"http://127.0.0.1:8080/v1", "api":"openai-completions", "apiKey":"none", "models":[ { "id":"Qwen3-27B", "contextWindow":8192, "maxTokens":4096 }, { "id":"Mellum-4B", "contextWindow":8192, "maxTokens":4096 } ] } } } If needed you can always run two model on different port but if I'm doing that I guess my pc are going to melt. We can select it via the same command: The response to hey was very quicker that the other model for the same amount of fan speed: So what's the deal? This is a code completion model so it can't answer question I guess: But! Let's try code completion so: Even from my machine this is quick enough to use it on Neovim for example with a code completion plugin. Ok the model is rapid enouth (it seem) so let's try it in a real project. For that I'm going to add that to my neovim config (with Lazy): { { "milanglacier/minuet-ai.nvim", config = function() require("minuet").setup({ provider = "openai_fim_compatible", n_completions = 1, -- Use 1 for local models to save resources context_window = 4096, -- Adjust based on your GPU's capability throttle = 500, -- Minimum time between requests in ms debounce = 300, -- Wait time after typing stops before requesting provider_options = { openai_fim_compatible = { api_key = "TERM", -- Ollama doesn't need a real API key name = "Ollama", end_point = "http://localhost:8080/v1/completions", model = "JetBrains/Mellum-4b-base-gguf", optional = { max_tokens = 256, -- Maximum tokens to generate stop = { "\n\n" },-- Stop at double newlines top_p = 0.9, -- Nucleus sampling parameter }, }, }, -- Virtual text display settings virtualtext = { auto_trigger_ft = { "*" }, -- Enable for all filetypes keymap = { accept = "", accept_line = "", next = "", prev = "", dismiss = "", }, }, }) end, }, { "Saghen/blink.cmp", }, } After reloading that into a project of mine (postier) I can try to use it: Hey that not bad! After some experiment the result is not alway very good but I guess the model is maybe not the more efficient and the settings can also be changed to allow for better result. But, hey, you got a mostly free auto-completion AI in you neovim that cool and that can spare some token usage I guess ^^ And this work when your offline too, so this can be a good option for certain use case too. Yeah I guess, just maybe not on this machine. What I’ve built here does work. It proves the point. You can run your own local models, wire them into tools like pi, plug them into Neovim, and get something that feels very close to the "big AI experience"… without sending your data somewhere else. But hardware matters. A lot. (outch). Right now I’m running this on a CPU-only laptop, and honestly, it shows. The experience is slow, sometimes frustrating, and clearly not comparable to cloud models. So not a great experience to use it everyday to work with. But if you take the exact same setup and drop it onto a more modern machine, things can change drastically. Take something like a MacBook Pro with Apple Silicon. These chips come with powerful integrated GPUs and unified memory, which are insanely good for this kind of workload. You can offload a large part of the model to the GPU, increase context size, and suddenly responses go from "go grab a coffee" to "this is actually usable." Same story on the PC side. A desktop or laptop with a decent GPU will run circles around a CPU-only setup. With proper GPU offloading (-ngl), quantized models, and a bit of tuning, you can get real-time or near real-time responses, even with larger models. What I like most about this whole thing isn’t just performance, it’s control: code stays local prompts stay private you can tweak, break, and rebuild everything you’re not tied to an API or pricing model That feels very close to what I want: owning your tools and understanding how they work. So yeah, my poor CPU-only laptop is struggling. And honestly… it kind of makes me want to upgrade my hardware (if I win the lotery). So I let you there, have fun!
