Trust the harness, not the model: a weekend of local agents building their own guardrails

A local 27B coding model, running on hardware in my house, is a coin flip. Some runs it nails the fix in twenty minutes. Some runs it edits the wrong file, writes a test that passes no matter what the code does, and tells you it is done. The bet behind LLMKube's Foreman was never that I would find a local model good enough to trust. It was that I could build a harness I trust more than any single model's output. This weekend tested that bet harder than any benchmark could, because the harness spent the weekend building its own guardrails.

Here is the short version of what happened across 0.8.12 and 0.8.13. My local coder built three new gates for itself. One of them shipped with the exact flaw it was written to catch, and the review caught it. Three new contributors sent four clean pull requests while the machines worked. The same model ran on an AMD box and an Apple Silicon Mac, and the Mac quietly won a round nobody expected. And not one byte of any of it touched a cloud API.

The thesis, stated plainly

Trust the harness, not the model. A coding agent on a local model produces output of wildly variable quality, and no amount of prompt tuning makes a 27B as reliable as a frontier model. So Foreman does not ask the model to be reliable. It wraps the model in a pipeline that is: the coder works in a cloned workspace, a fast in-workspace gate runs gofmt, vet, build, lint, and the unit tests for the packages it touched; a reviewer reads the diff against the issue; and a clean-room Kubernetes Job re-runs the full suite before anything is allowed to call itself a GO. Around all of that sit deterministic rails: scope checks, edit-free-streak detection, repo-map context. The model is a stochastic component inside a system whose job is to make the system's verdict trustworthy even when the component is not.

The interesting question is never "is the model good." It is "does the harness catch the model when it is bad." This weekend gave me an unusually honest answer.

The audit that started it

It opened with a regression. I shipped 0.8.12 and rolled it across the fleet, and the metal agent on my Macs stopped serving. The cause was a change Foreman itself had authored a few days earlier: it made the agent honor a per-service runtime field, but the agent registered its llama.cpp backend under the key llama-server while every InferenceService in my fleet (and the in-cluster controller, and the CRD's own default) uses the canonical value llamacpp. The two halves of the codebase disagreed on a name. Backward-incompatible, and it had passed the gate, passed review, and shipped.

That stung enough that I audited every PR Foreman had landed that weekend, looking for the same class of miss. I found a second one. A metrics change registered a time-to-first-token histogram and a request-error counter, complete with recording rules and a Grafana panel, that no production code ever emitted. The dashboard would have shown a confident, permanent zero.

Both bugs had the same shape, and it is the shape that should keep anyone running an agentic harness up at night: the tests passed without testing anything. The runtime change was tested with a made-up runtime value, never the real one the whole fleet uses. The metrics were "tested" by a unit test that incremented the counter itself and then asserted it went up. Self-confirming tests. The gate runs the tests and they are green, so the gate is happy. The gate never asked whether the tests would fail if the code were wrong. That is the harness's blind spot, and a stochastic model will find a blind spot every time you give it enough runs.

So the harness built its own guardrails

Every catch this weekend turned into a gate. I filed three issues for the exact failure classes the audit surfaced, and then I did the thing this whole project is about: I handed them back to Foreman and let the harness build the gates that make the harness better.

A scope guard. Score the issue's relevant files with the repo map, and reject a GO whose diff has zero overlap with them. This is the "you edited the wrong subsystem" catch, the one that used to need me watching.
A reviewer rubric. Two new checks the reviewer must apply: do the tests use the real values the system uses in production, not placeholders; and is every new metric, flag, or field actually wired into a production path, or only touched by tests. These are the #525 and #409 classes, written down as rules.
A bite check. The strongest one. A new or changed test must fail against the pre-change code. If it passes against both the old and the new code, it is not testing the change, and the gate rejects it as a non-biting test. This is the deterministic catch for the entire self-confirming class.

The first two landed clean. The coder produced both on the first try, gate-verified, on a dense 27B model running over Vulkan on an AMD Strix Halo box on my desk. The reviewer rubric is even pleasingly self-referential: its own "is this wired up" change is, in fact, wired up. I checked.

The part where it ate its own tail

The bite check is where it got honest. I ran a deep review on the three branches before signing off on any of them, the same kind of adversarial review the harness runs: an isolated worktree, revert the implementation, re-run the new tests, and confirm they fail without the feature. The scope guard passed. The rubric passed. The bite check did not.

The gate built to reject non-biting tests shipped with four of its own six tests non-biting. They asserted the baseline-equivalent happy path, so they stayed green with the feature removed. The exact anti-pattern the feature exists to catch, in the feature's own test file. It also had a real correctness bug (it could not revert a brand-new production file, so it would falsely reject a legitimate new-file PR) and it had been built into the fast gate when it belonged in the clean-room Job.

I want to be clear that this is not a story about the harness failing. It is the opposite. The model produced a flawed gate, and the review (which is part of the harness) caught it, cold, with empirical evidence, before a line of it merged. That is the entire thesis demonstrated at its sharpest: even when the model writes the harness, you trust the harness over the model. I sharpened the issue with the specific fixes and sent it back for another run.

The rerun, for the record, died on turn 16 to an unexpected EOF on the model's streaming connection, a transient network blip. And the harness did the right thing again: it classified the run as an infrastructure error, marked it incomplete, and pushed nothing. No half-finished branch, no false GO. A blip is not a bug, and the system knew the difference. I confirmed the model server was healthy and re-dispatched it. That is the unglamorous reliability work that makes "leave it running overnight" an actual sentence I can say.

Two coders, one model, and a surprise

The fleet running all this is heterogeneous on purpose. The coder model is a dense 27B, and this weekend I had it serving on two very different machines: an AMD Strix Halo box over Vulkan, and an Apple Silicon M5 Max over Metal. Same model, same quant, two accelerators that share almost nothing.

I expected the dedicated AMD box to be the workhorse and the Mac to be the slower second lane. The early numbers say otherwise. Measured at a realistic context depth, the Mac's prompt-processing throughput came in well above what the Strix turns in on its stable configuration, and the Mac is stable where the Strix's fastest decode path falls over at long context. This is an early, deliberately un-matched read (different KV configs, not run side by side), and I will not put a clean number on it until I run them back to back. But the direction is the interesting part: on this workload the small Apple node is not the slow one. Heterogeneous-by-design keeps paying off in ways I do not predict in advance.

The other half of trusting the harness

Here is the part I did not expect to be writing about. While the machines worked through the weekend, the repository did something a repository with a pulse does: other people showed up. Two contributors I had not worked with before sent three pull requests against LLMKube's router and inference APIs, a default-route strategy that kills a class of boilerplate, topology-spread and affinity passthrough for the inference pods, and a revision-history-limit knob for the deployments. All three were clean. Complete tests, both CRD copies synced, docs updated, CI green, sign-off ceremony followed. I reviewed each one closely and the only notes I had were minor. Then, while I was literally drafting this post, a third contributor opened a fourth: a tidy fix for a Helm chart bug where setting modelCache.enabled: false did not actually disable the cache. Root-caused, tested, approved. Same story.

And it clicked that this is the same thesis. The gates and the review that make a coin-flip 27B trustworthy are the same gates and review that let a newcomer's pull request land clean. The harness is not an AI feature. It is the project's quality floor, and it does not care whether the diff came from a local model on my desk or a human across the internet. A good harness is what lets you say yes to contributions without holding your breath. Sometime in the middle of all this, someone dropped a note in our Discord: "Great project folks. You just saved me two hours of debugging vllm." That is the whole point, on both ends.

What I actually believe now

You do not need a frontier model on your own hardware to do real engineering work locally. You need a harness you trust more than any single model's output. Build that, and a 27B on a desktop becomes a useful, supervised coworker, one whose mistakes are caught by a system instead of by you reading every diff at midnight. Build that, and the same system becomes the thing that lets a community build on top of you.

The model produced a broken gate this weekend. The harness caught it. Three new contributors improved the project, and the harness vouched for their work the same way it vouches for the model's. That is not a contradiction to manage. That is the design working. Trust the harness, not the model.

LLMKube is Apache 2.0 and runs on Kubernetes with NVIDIA, Apple Silicon, and AMD. If you want to watch a local fleet do this, or send the fifth clean PR, the project is on GitHub and we are in Discord.