Running an LLM at home
How I learned to stop worrying and love LLMs
About a month ago, I found myself with a weekend to myself and I thought "Hmmm, it's been a while since I tried to run an LLM at home." It's been so long that I've actually upgraded from a RT 6600 XT to an RTX 4070 Ti Super, RT 7900 XT and most recently an Arc B570. Let's call it my personal testbench.
Since the RT 7900 XT has 20GB of VRAM, I decided to use that as a testbed for trying out new LLMs. It's run on a Intel Core 9 285K with 64GB of DDR5 RAM. I tossed Ubuntu 26.04 on it and installed lm-studio and went with Qwen3.6 26B. Why Qwen3.6 26B? Well, it scored well and is small enough that I can run it on 20GB of VRAM with enough KV-cache context to be useful. And guess what? Out of the box, it got around 10 tokens/second. Not bad, so I kept the 8bit KV-cache.
There was a problem though: I had to be on lm-studio to ask questions, which meant I had to be logged into my workstation. I wanted a setup that would be similar to Anthropic Claude's Visual Studio Code plug-in. I picked CrewAI as an orchestrator so I could run over my testbench network. And I started writing a VSCode extension, as well as a web interface. The orchestrator ran on a separate laptop, because why make things simple when I can make them have multiple points of failure? 😂
Having an orchestrator on a separate machine allowed me to do separate the logic of the orchestrator from the running of the LLM. This served another purpose: I also wanted to be able to have a single interface to also generate text-to-images. But, I wanted to do it simultaneously as running Qwen3.6 27B. So, I decided to setup a different box I had with an old AMD Ryzen 3600 with an Intel Arc B570. This one I setup with ComfyUI and and StabilityAI SDXL Turbo 1.0.
Initially the orchestrator was hardcoded for the user to choose what it wanted, whether the programming language (Python, C++, etc) or generating an image. I got far along enough that I could choose the text-2-image and it would generate and return an image. Success!
So, I switched and focused on Qwen3.6 27b and it's ability to program. I wanted it to work like Claude in VSCode: ask for "GEMM C++ program" and it would generate files and save them wherever the VSCode extension was running. But, I wanted to eliminate the middle man (ME) that would compile and run it. So, I built the CrewAI to also pass it to Mac runner. After dozens or so iterations, it worked! I could ask for a "GEMM C++ app" and it would finish by generating files, compiling and running them on my Mac, and generating and saving the files where the VSCode extension was running.
The orchestrator ran on an old laptop which happened to have an RTX 2060 on it. So, I decided to have the orchestrator be "intelligent." In other words, I wanted the orchestrator to be a project manager agent that would orchestrate for me. Unfortunately, the RTX 2060 mobile with 6GB of VRAM could only run Qwen 3L 4B and that did not have the horsepower to be a PM. So, I installed the orchestrator and lm-studio on my gaming rig, which has an RTX 4070 Ti Super with 15GB of VRAM. But, for the PM I used Qwen3.6 35B because it uses MoE and only keeps 3B resident. This allowed for a larger KV-cache so a longer context.
Unfortunately, after dozens of iterations on using Qwen3.6 35B as an agent, I went back to my dumb orchestrator. The PM would break down what needed to happen into (depending on the bit size of the KV-Cache) 6 to 12 steps. But, the "software engineer" also received the request ("Write a GEMM C++ app") for context purposes and the step it needed to do. And the software engineer would then go through the process of also breaking down into manageable steps and then execute on all of the steps. So, the PM was redundant.
What did I learn? The LLM ecosystem has come a long way. Standing up LLMs and text-to-images was much easier this time around which is great. But, there's a lot of customization that I wanted and I started running that way for fun. I'm sure there are out-of-the-box solutions for the things I wanted, but I wouldn't have learned from that. And that's the point. I'd also say that there are knobs to play with for the LLM as well. For instance, KV-cache bit size directly affects how frequently it gets stuck in loops when "thinking." Finally, while I could get 20 tokens/second out of both Qwen3.6 27B and Qwen3.6 35B, it was still too slow for my tastes and would work in a pinch if I needed it to. But, even though it was slow for my tastes, it still produced compilable C++ code that ran within the parameters I've asked. That is incredible on a GPU that had a launch price of $899. Absolutely incredible.
TL;DR: I had a weekend to play with LLMs and text-to-images. I setup a home network to separately run an orchestrator, Qwen3.6 27B, StabilityAI SDXL 1.0, and Qwen3 3L 4B on distinct GPUs. The orchestrator ran CrewUI which provides to the local network a web page or VSCode extension that would allow an easier interface to generating images or code. I experimented with using a more powerful GPU and larger LLM to be a project manager (Qwen3.6 35B) to the software engineer (Qwen3.6 27B) but the software engineer didn't really need it.