My Mac kernel-panicked, so I taught my code reviewer to stop trusting itself: 48 hours of local-model review integrity
Our earlier posts proved local agentic coding works: overnight batches, a clean end-to-end fix in 2 minutes 39 seconds, a 26B model shipping a real PR. The open question was always the reviewer. Our own controlled comparison in May showed local models rubber-stamping scope drift that a cloud frontier model caught every time, and the honest summary then was "fully local works, but the review judgment isn't there yet." This is the 48 hours that changed the shape of that sentence. Not because a local model suddenly got frontier judgment, but because we stopped needing it to. Trustworthy review is an architecture property, not a model property.
Act I: the panic
Monday night, mid-Foreman-testing, my 128 GB Mac hard-froze and rebooted. The panic report is one of the most honest log excerpts ever written:
panic: watchdog timeout: no checkins from watchdogd in 92 seconds
memoryPages.wired: 8,084,433 // x 16KB = 123.4 GB of 128 GB
memoryPages.free: 914 // 14 MB 123.4 of 128 GB wired. On Apple Silicon, model weights and KV cache live in Metal buffers that macOS can neither page out nor jetsam-kill. When they fill the machine there is no graceful OOM. The whole host stalls until the kernel watchdog pulls the plug.
The root cause was a memory admission check that failed open. The metal-agent estimates each model's footprint before spawning llama-server, but when it could not determine a model's size (fresh host, model not yet downloaded, status not yet populated), it logged memory estimation failed, proceeding without check and started the process anyway. Every unsized model sailed through, concurrently, until the host died.
The fix shipped the next morning as 0.8.2 (#641): admission now fails closed. A model that cannot be sized does not start; it gets SchedulingStatus: MemoryCheckFailed and a Kubernetes event explaining why. Before refusing, the agent tries one more sizing source: a HEAD request to the model's URL, so a not-yet-downloaded model is sized by its Content-Length and admission is decided before the download begins. An escape hatch (--memory-check-mode warn) restores the old behavior if you really want it. Verified live the same night: two correctly-sized models admitted from disk, two unsizeable ones refused with clear events instead of a panic. One of the refused ones turned out to have a broken source URL we did not know about. Fail-closed found that too.
The lesson that frames the rest of this post: the control plane must own the checks the workload cannot be trusted to pass. Hold that thought.
Act II: your reviewer's judgment is a coin flip, and that is fine
Foreman's review step asks an independent local model to check a coder's branch against what the issue actually asked. Back in May we built a trap for it: a branch whose diff is clean, plausible, well-tested code that simply does not address its issue. The issue asks for an RBAC-sync CI guard; the diff implements dependency handling in a controller. Scope drift. In our May comparison, a cloud frontier model caught it every time and our local models mostly did not.
Here is the result that reframed everything. Devstral, our production reviewer, caught that trap in May and then missed it twice in June. Same model, same branch, same prompt. Reviewer judgment is not weak. It is stochastic. You cannot prompt-engineer your way to a property the sampling temperature does not have.
So we stopped trying to make the model trustworthy and made the verdict trustworthy instead. Two rails, both in 0.8.3:
Rail 1: evidence enforcement (#645). The executor already ground-truthed the reviewer's claimed understanding of the issue (the issueAsk field) against the issue body the model itself fetched. That used to be observe-only. Now it routes: a reviewer that cannot prove it read the issue cannot approve a branch. An unverifiable approval demotes to NO-GO, which the workload controller already escalates to a bigger model. How necessary was this? Across every local model we have ever run (Devstral, Mellum2, North Mini, Qwen-family), the verbatim-quote verification has passed approximately never. Local models do not quote; they confabulate plausible paraphrases. One of them rejected a one-line docs change for "failing to implement mlx model format support," an ask that exists nowhere but in its own imagination. The rails caught that one too.
Rail 2: computable scope overlap (#648). The trap above does not actually require intelligence to catch. The issue names five concrete files. The diff touches zero of them. That intersection is a git diff --name-only and a conservative path extractor away, with no model judgment anywhere in the path. When the issue names files and the diff touches none of them, a GO demotes deterministically.
The first production run after the rails landed is the whole story in one log line: devstral missed the trap again (that is the coin flip), its quote failed verification, the scope check independently flagged zero overlap, and the final verdict came out NO-GO. The model was wrong and the system was right. That is the property you actually want.
Bonus: rail 2's first week of work included catching contamination in our own test fixtures. One of our "known good" exam branches had been silently recycled by a later automation run and no longer contained the fix it was supposed to. Three models' worth of prior verdicts had been judged against a moving target. A deterministic guard found what every model review, and the humans reading them, had missed.
Act III: the model that could only say no
Two days after Cohere released North Mini Code (30B MoE, 3B active, Apache 2.0, trained with RL on verifiable agentic tasks), it was serving on our reviewer node at 102 tokens per second. We built llama-server from the still-unmerged architecture-support PR rather than wait. Then we gave it the entrance exam: five frozen branches with known ground truth, including the trap.
Getting the exam to run was its own saga, and an honest scorecard of what thinking models do to an agentic harness. North Mini reasons before it acts, and our loop, like (we suspect) most homegrown agent loops, silently dropped the reasoning_content field. A turn spent thinking looked like an empty reply; the loop nudged; the model thought harder; repeat until dead. Supporting reasoning models properly took three PRs (#651, #652, #653): parse and aggregate the reasoning stream, map the new failure modes into the failure taxonomy, and keep stripped thinking turns from producing template-invalid empty messages that poison the rest of the conversation with HTTP 400s. The exam also flushed out four more latent bugs, from a CRD enum that could not store the one verdict the prompt mandated for honest failure, to stale endpoint registrations after a network blip. Seven real bugs, found by trying to onboard one model. The entrance exam is now load-bearing infrastructure.
And then the result. North Mini caught the trap. On its own judgment, with its reasoning enabled: REQUEST-CHANGES, with a finding that correctly cites the real ask ("the issue requires a CI guard that diffs the Helm chart's ClusterRole rules; this diff does not address it"). First local model to do it, ever, across every model and run we have tested. It caught the second drifted branch too. Rejection precision: perfect.
Then we ran the positive control, a genuinely clean branch that deserves an APPROVE, and discovered the strangest model behavior we have ever characterized. North Mini cannot approve. It reviews flawlessly, reaches the end, and then re-verifies. It re-fetches the issue (six times, in one run). It re-runs the same diff with different flags. Forty turns. Sixty turns. Forty-five minutes of wall clock.
We falsified hypotheses one at a time like a lab class: not the turn budget (raised it), not the wall clock (doubled it), not the impossible verbatim-quote mandate (softened it; the harness owns verification now anyway), not even the "when in doubt, choose REQUEST-CHANGES" clause (replaced it with explicit permission to approve and reassurance that residual uncertainty is normal). Six controlled runs on the same frozen branch. It never approved. The paralysis is intrinsic: this model, on vouch-shaped tasks, will not put its name on an endorsement.
In five completed reviews it produced zero wrong verdicts, and zero approvals. Its failures are all honest non-conclusion. Compare devstral: concludes every time, coin-flip precision. Two perfect complements.
What this adds up to
The naive plan was "find the local model with good judgment and put it in the reviewer slot." Every part of that plan dissolved on contact with data:
- Judgment is stochastic within a model (devstral: 1-for-3 on the same input).
- Evidence fields are confabulated by every model.
- The best judge we found cannot approve anything.
- The deterministic checks outperformed all of them on the failure mode that matters most.
What works is the architecture: a reviewer that always concludes (devstral), rails that catch fabricated evidence and computable drift with no model involved, a drift specialist whose rejections are gold and who is never asked to vouch (North Mini's actual calling), and escalation to a bigger model whenever the cheap layers disagree. Three different labs' models checking each other, deterministic verification underneath, on two Macs and a pair of consumer GPUs. Cloud inference spend across all of it: zero.
That is what "fully local" turns out to mean. Not a local model that matches a frontier model. A local system that does not need one.
The numbers
| Measurement | Result |
|---|---|
| Trap scoreboard, models unaided | cloud frontier 1/1 · North Mini (reasoning) 1/1 · Devstral 1/3 · Qwen 0/1 · Mellum2 0/1 |
| Trap scoreboard, system with rails | 3/3 |
| issueAsk verbatim verification pass rate, all local models | ~0% |
| North Mini decode on M5 Max | 102 tok/s (fastest 30B-class we have run) |
| North Mini review turns when rejecting | 14 to 32 turns, 2 to 4 minutes |
| North Mini approvals in 6 attempts up to 45 minutes | 0 |
| Harness bugs found by one model onboarding | 7 |
What shipped (0.8.2 + 0.8.3)
- Fail-closed memory admission with HEAD-probe sizing (#641)
- Reviewer evidence enforcement with verdict demotion (#645)
- Computable scope-overlap drift detection (#648)
- Hybrid-thinking model support end to end (#651, #652, #653)
- Reviewer contract rewritten so the harness owns evidence integrity (#655)
- Filed with reproductions: #649, #654, #656, #657
What's next
- The panel architecture: a drift-specialist contract for North Mini ("report absence of drift," never "vouch"), quorum across reviewers, escalation on disagreement.
- Escalation on INCOMPLETE, not just NO-GO, so an honest non-conclusion routes somewhere useful.
- The pipeline e2e test (#521), so the next seven bugs are found by CI instead of by an exam at 1am.
- An approval-head experiment: our fleet history is becoming a labeled dataset of verdicts with ground truth. A small fine-tune against it is the long game.
If you want to try any of this, the getting-started guide covers the operator, and the Foreman chart is opt-in on top. Everything in this post ran on hardware we own, and the same will be true of the follow-ups.