Foreman model compatibility
The Foreman v0.1 native agent loop assumes the inference endpoint
behind every Agent speaks OpenAI-style function calling: it
emits structured tool_calls in chat-completions responses, the
loop parses them, dispatches the named tool, and feeds the result
back as a tool-role message keyed by tool_call_id. That assumption
is true for most modern open-weights instruct models served via
llama.cpp / llama-server / vLLM / mlx-server, but it isn’t
universal.
This page is the calibrated table of what we’ve empirically validated. If a model isn’t here, that doesn’t mean it doesn’t work. It means we haven’t run it. Pull requests adding entries welcome.
How to read this table
- Role: the Agent role we tested the model in (coder, reviewer, verifier).
- Tool protocol: whether the model emits OAI-shaped
tool_callsin llama.cpp / mlx-server / vLLM. ✓ means yes; ✗ means no. - Confabulation rate: subjective rating of how often the
model’s terminal
submit_result.extrafields contained text that wasn’t grounded in its own earlier tool calls. The harness reconciles known confabulation surfaces server-side (see #582 and thereconcileReviewer*helpers inpkg/foreman/agent/executor_native.go), so even a high-confab model is usable; the rate just describes how much work the reconciler does. - Notes: observed quirks worth documenting.
Tested matrix (v0.4 reviewer release)
| Model | Quant | Host | Role | Tool protocol | Confab | Notes |
|---|---|---|---|---|---|---|
| Qwen3.6-35B-A3B (Carnice MoE) | Q8_0 | M5 Max 128GB | coder | ✓ | low | Reference coder. Verified end-to-end on real LLMKube issues. |
| Qwen3.6-35B-A3B | Q8_0 | M5 Max 128GB | reviewer | ✓ | low | Same model serves as same-family reviewer. Catches Section G (godoc/code consistency) reliably. |
| Devstral-Small-2 24B-Instruct-2512 | Q6_K | Mac Studio 36GB | reviewer | ✓ | high | Tools all dispatch correctly; terminal submit_result.extra.issueAsk and filesTouched frequently confabulated on multi-file diffs. Harness reconciles both server-side. |
| Gemma 3 27B-it | Q6_K | Mac Studio 36GB | reviewer | ✗ | n/a | Does not currently work. Emits tool invocations as Google’s native markdown \``tool_codeblocks rather than OAItool_calls. The loop sees zero tool_calls on turn 1 and force-terminates asModelMisunderstood. Tracked as [#589](https://github.com/defilantech/LLMKube/issues/589); fixed when thetoolProtocol` adapter work lands in v0.5. |
| Mistral-Small-3.2 24B-Instruct-2506 | Q6_K | Mac Studio 36GB | reviewer | ⚠️ | n/a | Under investigation. First chat-completions request hangs indefinitely (llama-server health endpoint stays OK, CPU drops to 1.3%, no client-side timeout fires). May be a Metal-perf path issue specific to this model or an HTTP-streaming-shape issue. Tracked as #590. |
How the harness handles confabulation
For reviewers whose tool protocol works but whose terminal payload is unreliable, the executor reconciles two fields server-side before the result is stored:
filesTouchedis rewritten to the output ofgit diff --name-only main...HEADin the workspace. The model’s original claim lands atfilesTouchedClaimedfor archaeology. Shipped in #584.issueAskis checked against the body the model fetched via thefetch_issuetool. If the claim is a literal substring of the body it’s marked verified; otherwise it’s archived atissueAskClaimedand rewritten with the first useful paragraph of the body. Shipped in #587.
A new boolean field issueAskVerified signals to downstream
consumers whether the stored value came from the model
verbatim or from the harness rewrite.
This means the verdict (which drives the cascade rule) is still based on the model’s reasoning, but the anchor fields downstream tools pivot on (which file did the diff touch? what does the issue actually ask for?) are harness-authoritative.
What v0.5 changes
The current Agent CRD shape doesn’t carry an explicit tool
protocol field. That makes the Gemma 3 finding above a footgun: a
user can apply an Agent CR pointing at a Gemma 3 InferenceService
and watch every AgenticTask fail as ModelMisunderstood without
the operator catching the misconfiguration ahead of time.
The v0.5 plan (#589) adds:
Agent.spec.toolProtocol: an enum (oai-function-calling,google-tool-code-blocks,anthropic-xml,text-marker) that declares which protocol shape the executor should expect.- Adapters in
pkg/foreman/agent/oai/that translate non-OAI protocols into the loop’s internaltool_callsshape. - A pre-flight validation on
Agentreconcile that probes the referenced InferenceService and flags a misconfiguredtoolProtocolbefore any AgenticTask binds to it.
Until that lands, the practical advice is: stick to models in the “tested ✓” rows above for v0.4. The Qwen and Mistral families broadly work in llama.cpp’s OAI tool-calls implementation; the Gemma and (currently) Mistral-Small-3.2 paths don’t.
Contributing entries
If you run Foreman against a model not in this table, please file an issue or PR with:
- Model + quantization + host hardware
- Role you tested it in
- Whether the loop reached
submit_result(tool protocol ✓ / ✗) - A subjective confabulation rate if it did
- Any reproducing notes for the failure modes you saw
The table grows the same way LLMKube’s hardware matrix does: people running real workloads on real hardware reporting what they actually saw.