Skip to content
Skip to documentation content
Browse documentation

Foreman model compatibility

The Foreman v0.1 native agent loop assumes the inference endpoint behind every Agent speaks OpenAI-style function calling: it emits structured tool_calls in chat-completions responses, the loop parses them, dispatches the named tool, and feeds the result back as a tool-role message keyed by tool_call_id. That assumption is true for most modern open-weights instruct models served via llama.cpp / llama-server / vLLM / mlx-server, but it isn’t universal.

This page is the calibrated table of what we’ve empirically validated. If a model isn’t here, that doesn’t mean it doesn’t work. It means we haven’t run it. Pull requests adding entries welcome.

How to read this table

  • Role: the Agent role we tested the model in (coder, reviewer, verifier).
  • Tool protocol: whether the model emits OAI-shaped tool_calls in llama.cpp / mlx-server / vLLM. ✓ means yes; ✗ means no.
  • Confabulation rate: subjective rating of how often the model’s terminal submit_result.extra fields contained text that wasn’t grounded in its own earlier tool calls. The harness reconciles known confabulation surfaces server-side (see #582 and the reconcileReviewer* helpers in pkg/foreman/agent/executor_native.go), so even a high-confab model is usable; the rate just describes how much work the reconciler does.
  • Notes: observed quirks worth documenting.

Tested matrix (v0.4 reviewer release)

ModelQuantHostRoleTool protocolConfabNotes
Qwen3.6-35B-A3B (Carnice MoE)Q8_0M5 Max 128GBcoderlowReference coder. Verified end-to-end on real LLMKube issues.
Qwen3.6-35B-A3BQ8_0M5 Max 128GBreviewerlowSame model serves as same-family reviewer. Catches Section G (godoc/code consistency) reliably.
Devstral-Small-2 24B-Instruct-2512Q6_KMac Studio 36GBreviewerhighTools all dispatch correctly; terminal submit_result.extra.issueAsk and filesTouched frequently confabulated on multi-file diffs. Harness reconciles both server-side.
Gemma 3 27B-itQ6_KMac Studio 36GBreviewern/aDoes not currently work. Emits tool invocations as Google’s native markdown \``tool_codeblocks rather than OAItool_calls. The loop sees zero tool_calls on turn 1 and force-terminates asModelMisunderstood. Tracked as [#589](https://github.com/defilantech/LLMKube/issues/589); fixed when thetoolProtocol` adapter work lands in v0.5.
Mistral-Small-3.2 24B-Instruct-2506Q6_KMac Studio 36GBreviewer⚠️n/aUnder investigation. First chat-completions request hangs indefinitely (llama-server health endpoint stays OK, CPU drops to 1.3%, no client-side timeout fires). May be a Metal-perf path issue specific to this model or an HTTP-streaming-shape issue. Tracked as #590.

How the harness handles confabulation

For reviewers whose tool protocol works but whose terminal payload is unreliable, the executor reconciles two fields server-side before the result is stored:

  • filesTouched is rewritten to the output of git diff --name-only main...HEAD in the workspace. The model’s original claim lands at filesTouchedClaimed for archaeology. Shipped in #584.
  • issueAsk is checked against the body the model fetched via the fetch_issue tool. If the claim is a literal substring of the body it’s marked verified; otherwise it’s archived at issueAskClaimed and rewritten with the first useful paragraph of the body. Shipped in #587.

A new boolean field issueAskVerified signals to downstream consumers whether the stored value came from the model verbatim or from the harness rewrite.

This means the verdict (which drives the cascade rule) is still based on the model’s reasoning, but the anchor fields downstream tools pivot on (which file did the diff touch? what does the issue actually ask for?) are harness-authoritative.

What v0.5 changes

The current Agent CRD shape doesn’t carry an explicit tool protocol field. That makes the Gemma 3 finding above a footgun: a user can apply an Agent CR pointing at a Gemma 3 InferenceService and watch every AgenticTask fail as ModelMisunderstood without the operator catching the misconfiguration ahead of time.

The v0.5 plan (#589) adds:

  • Agent.spec.toolProtocol: an enum (oai-function-calling, google-tool-code-blocks, anthropic-xml, text-marker) that declares which protocol shape the executor should expect.
  • Adapters in pkg/foreman/agent/oai/ that translate non-OAI protocols into the loop’s internal tool_calls shape.
  • A pre-flight validation on Agent reconcile that probes the referenced InferenceService and flags a misconfigured toolProtocol before any AgenticTask binds to it.

Until that lands, the practical advice is: stick to models in the “tested ✓” rows above for v0.4. The Qwen and Mistral families broadly work in llama.cpp’s OAI tool-calls implementation; the Gemma and (currently) Mistral-Small-3.2 paths don’t.

Contributing entries

If you run Foreman against a model not in this table, please file an issue or PR with:

  • Model + quantization + host hardware
  • Role you tested it in
  • Whether the loop reached submit_result (tool protocol ✓ / ✗)
  • A subjective confabulation rate if it did
  • Any reproducing notes for the failure modes you saw

The table grows the same way LLMKube’s hardware matrix does: people running real workloads on real hardware reporting what they actually saw.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.