A 27B model on an AMD mini-PC fixed a bug in our operator. Then it overreached.
Last time Foreman built a feature for itself. This is the sequel, and it is a better story: this time it fixed a bug I had just shipped, the model doing the fixing ran on a consumer AMD box on my desk, and the most useful moment in the whole run is where the model got it wrong.
TL;DR
- Wiring up Claude Code against one of my own local models surfaced a real bug in the operator: a hardcoded 60-second timeout that silently capped every request regardless of what you configured.
- I handed it to Foreman with a 27B dense coder (Qwopus, a Qwen3.6 distill) on a consumer AMD Strix Halo machine over Vulkan. No datacenter, no NVIDIA, no cloud GPU.
- The model produced the correct fix and the gate confirmed it compiled, vetted, linted, and passed. It also quietly wrote two tests for unrelated code, and the gate passed those too. I caught that on review and trimmed them. That gap is the whole point.
- Verified end-to-end: a request that used to die at exactly 60 seconds now completes in 128 seconds. Shipped in
0.8.16. - Every step of the loop ran on hardware I own.
1. A coding agent on a model I run myself
The fleet here is deliberately mixed: an NVIDIA box, a couple of Apple Silicon machines, an AMD Strix Halo machine with 128 GB of unified memory. The point of LLMKube is to serve models across all of it from one spec. So the obvious experiment was to stop renting a frontier model for agentic coding and point Claude Code at one of my own models, through the gateway, on my own hardware.
Short tasks worked. Then I gave it a real one, and after about twelve minutes of churning it died with an opaque API error. Not the model's fault. Mine. The agent had just walked straight into a bug in the operator the hard way.
2. The bug it surfaced
The ModelRouter compiles onto an Envoy AI Gateway, and when it generated the retry policy it hardcoded the per-attempt timeout:
"perRetry": map[string]interface{}{
"timeout": "60s", // <- hardcoded
}, Envoy applies that per-attempt timeout on top of the route-level request timeout. So no matter that the route was set to 30 minutes for long generations, every request was silently capped at 60 seconds, and the resulting 504 was not in the retry list, so it failed outright instead of retrying. Short turns finished under a minute and looked healthy. The first turn whose prefill grew past a minute fell off the cliff.
Clean, embarrassing, entirely self-inflicted. Exactly what Foreman is for.
3. Handing the bug to the fleet
Foreman is the agentic harness built into LLMKube: point it at an issue, a coder model works the fix in a sandbox under a set of rails, and a verification gate checks the result before anything counts. The choice worth dwelling on here is the hardware.
The coder was Qwopus3.6-27B-Coder, a dense 27B Qwen3.6 distill trained on Claude-Opus traces, 4-bit quantized, served through llama.cpp's TurboQuant fork on a consumer AMD Strix Halo machine (gfx1151, Vulkan). Not an H100. Not even an NVIDIA card. A mini-PC class box.
I filed the bug as #817, scoped the task to the one file and its test, and dispatched it. The coder found the function and made the right change: a small helper that derives the per-attempt timeout from the largest timeout actually configured across the router's rules and backends, with a named-constant fallback. It wrote tests and submitted GO. The gate ran gofmt, go vet, go build, a Linux-target golangci-lint, and the full suite in a clean room. All green.
4. Where it overreached, and the gate didn't catch it
Here is the part the "AI fixed my bug" posts skip.
The fix was correct, and its own two tests were good: one asserts the policy uses the configured 30-minute timeout, one asserts the 60-second fallback. But the model had also written two more tests for a completely unrelated function, default-route compilation, that the change never touched. They compiled. They passed. The gate had no opinion, because they were green, and green is all a gate measures.
Scope is not something a compiler or a linter can judge. That is a human call, and on review I made it: trimmed the two unrelated tests so the change was exactly the fix plus its own tests, and pushed the clean version.
This is why I build harnesses instead of waiting for a bigger model. A 27B model on a desktop is not Claude, and the trick is not pretending it is. What it does reliably is produce a correct, compiling, tested change inside a gate that guarantees a verified floor. What it cannot do is judge scope and intent. The harness gives you the floor for free; you bring the judgment. Trust the harness, not the model. That division of labor is what makes a model this size genuinely useful instead of a parlor trick.
5. Proof, on the cluster
A passing test is not a working system, so I verified before believing it.
I built the patched operator the way everything gets built here: in-cluster, a kaniko job on my own node, from the fix branch. Deployed it, watched the operator regenerate the gateway policy, and confirmed the per-attempt timeout flipped from 60s to 30m0s live. Then I fired the request that used to die at the one-minute mark:
HTTP 200 in 128.1s Before the fix, a 504 at exactly sixty seconds. After it, two minutes of clean generation. That is the difference between a green checkmark and a fixed bug, and it is worth the extra ten minutes every time. Shipped in 0.8.16.
6. Where this ran
Look at where every piece of that loop lived. The coder: a consumer AMD box on my desk. The build: a job on my own node. The deploy and verification: my own cluster. The model behind the original coding session: an Apple Silicon machine in the same house. Nothing touched a cloud GPU, a rented datacenter fleet, or a control plane sitting above someone else's clusters.
That is the thesis, not a footnote. Most tooling around self-hosted inference still assumes "self-hosted" means a rack of datacenter accelerators operated at scale. LLMKube assumes the opposite: that the frontier worth building for is making inference, and now agentic work on top of it, run well across the heterogeneous hardware you already own. Apple Silicon, consumer AMD, a couple of NVIDIA cards, an edge box at a remote site. One operator, one model spec, hardware you control.
A 27B model on a mini-PC fixing a real bug in the operator that serves it, verified on the cluster, is the smallest concrete proof of that. As of today it is also a true story.
Run it
- Quickstart: serve a model on your GPU, Apple Silicon, or AMD box in a few minutes.
- The fix: #817, PR #818.
- The cost math behind any of this: InferCost.
If you run it on hardware I haven't tried, I want to hear about it. A star helps more people find it.
Run Your Own Agentic Coding Setup
LLMKube manages model downloads, GPU scheduling, and exposes an OpenAI-compatible API across NVIDIA, Apple Silicon, and AMD. Point any agentic tool at it and keep every token on your network.