Introducing Foreman: a Kubernetes-native orchestrator for your local LLM fleet (LLMKube 0.8.0)
0.8.0 is the release where LLMKube grows a control plane for agentic work. It introduces Foreman, an opt-in add-on that dispatches coder, verifier, and reviewer agents across a heterogeneous fleet of locally-hosted LLM nodes, and routes each task to the node whose hardware actually fits the job. The way we know it works is the only way we trust: Foreman authored its own debut pull requests against this repository, including the docs nudge you'll see linked below. Here's what landed.
What Foreman is
Foreman is four Kubernetes custom resources, a capability-aware scheduler, and a native Go agent loop. You declare a Workload ("fix these eight issues in this repo"). The reconciler expands that into a pipeline of AgenticTasks: one for the coder agent, one for the verifier (gate), one or more for reviewer agents. Each task references an Agent, the reusable role definition that names a system prompt, a tool whitelist, and an inference endpoint. The scheduler matches each task to a FleetNode whose advertised capability (accelerator family, RAM, context window, role) satisfies the Agent's requirement, and the node's local agent claims it and runs the loop.
The loop is OpenAI function-calling against your local llama-server / mlx-server / vLLM endpoint, in-process Go, no subprocess. Tool calls are structured tool_calls, not text markers. Every turn carries an OpenTelemetry span. When the coder agent's submit_result says GO, Foreman commits the workspace with a DCO sign-off and pushes the branch to your fork. The verifier picks the branch up and runs your gate (in our case make fmt vet lint test manifests chart-crds) as a Kubernetes Job. The reviewers read the diff against the issue body and score it. Verdicts cascade to the parent Workload, and the fork branches sit there ready for a human to inspect or open as upstream PRs.
If you only want LLMKube for serving local models, the operator and CRDs you already use are unchanged. Foreman ships as a separate Helm chart with its own API group (foreman.llmkube.dev) and its own controller and node agent binaries. You install it on top of LLMKube when you're ready for the pipeline shape; you ignore it otherwise.
The dogfood: Foreman shipped real PRs against this repository
The headline test for any orchestrator is whether it can actually orchestrate. So Foreman was pointed at LLMKube's own issue tracker. Over Memorial Day weekend it ran six batches against curated issues, on a three-node fleet:
- Coder node: MacBook Pro M5 Max (Apple Silicon, 128 GB unified memory) running
qwen36-35b-carnice-mtpvia the metal-agent. - Verifier node: ShadowStack, dual NVIDIA 5060 Ti Linux/K8s, running the gate job in a CUDA-tagged container.
- Reviewer node: Mac Studio M4 Max (Apple Silicon, 36 GB) running a different model family for cross-reviewer diversity.
Two contributions ended up upstream. PR #508 added the make lint-all target back on May 21, the V3 first-demo run. The branch that opens this release, PR #588, is the docs nudge that points contributors at that target from AGENTS.md and CONTRIBUTING.md. The commit is DCO-signed as Foreman Bot, the gate passed cleanly, two reviewer agents on two different hosts approved it, and the branch was opened as the upstream PR you see linked.
The pipeline that produced that two-file change spans three machines, three model families, and the entire Workload → AgenticTask → Agent → FleetNode declarative graph. The wall-clock cost of a similar size diff is in the ballpark of one cent of electricity, with no API quota involved. The interesting cost isn't the pennies; it's that the same machinery handles the next eight issues without you doing anything in between.
The same dogfood weekend surfaced real bugs in Foreman itself. The reviewer agents flagged that one model family was confidently inventing filesTouched and issueAsk fields in its terminal payload even when its earlier tool calls returned correct data. We shipped #584 and #587 to make the harness authoritative on both fields (the executor now overwrites them from git diff and the fetch_issue tool result respectively), and #581 to swap the reviewer's gh issue view shell-out for an in-process tool that uses the agent's own GitHub token. Same dogfooding pattern as 0.7.9 with the autoscaling tutorial: walking the path on real hardware found the things we needed to fix to make the path actually work.
Who Foreman is for
Foreman is infrastructure for shops with more than one machine and a reason to keep agentic work local. The target profile:
- You have on-prem GPU or Apple Silicon already paid for. A few NVIDIA boxes, a couple of Mac Studios, a small Apple Silicon edge fleet. You'd like that hardware to do agentic batch work overnight without going to a cloud API. Foreman is what makes "fleet" a real noun instead of a list of hosts.
- You have a sovereignty or compliance constraint. Regulated industries (healthcare, defense, financial services), public-sector procurement, customers whose contracts forbid data egress, or just a strong preference for not paying per-call cloud pricing on agentic workloads. Foreman keeps the data, the model weights, and the decisions on hardware you control.
- You already use Kubernetes. Foreman is K8s-native: CRDs, controllers, Helm chart, RBAC, OTel. It slots into your existing observability and policy patterns rather than fighting them.
- You want capability-aware dispatch, not single-host orchestration. Agentic frameworks like CrewAI and LangGraph are great inside one process. Foreman lives a layer below: it routes which physical machine each step runs on, based on what hardware each step needs.
Foreman isn't a replacement for Cursor or aider for individual developers writing code at their desk; those are excellent tools, and they're optimised for a different shape of work. Foreman is for batch agentic workloads that span a fleet: overnight bug-fix runs, scheduled reviews, repeated evals, the on-prem analog of the kind of work that's currently sitting in someone's cloud bill.
What v0.1 deliberately doesn't ship
Foreman v0.1 is the foundation, not the finished platform. A few capacities we know people will ask for, and that we deliberately punted to keep the v0.1 surface honest:
- Linear pipelines only. The pipeline shape is coder → verifier → reviewers, with a one-step
dependsOnchain. Full DAGs (parallel branches, joins, fan-out across competing candidates) are v0.2 territory. - No best-of-N or jury selection. The reviewers score the coder's diff but don't pick between competing coder candidates. That selection step lands in v0.2 as a separate role.
- No autonomous planner. The current planner is a stub: you hand it an explicit list of issues, or an explicit pipeline. The LLM-driven planner that decomposes a free-text Workload intent into a pipeline ships in v0.2; v0.1 keeps the CRD shape it'll plug into.
- No self-improving routing. The capability matcher today is fixed rules. The AgentScore corpus that biases future dispatch based on past outcomes is on the roadmap; v0.1 just records the data.
- Model-tool-protocol compatibility is implicit, not declared. Foreman currently assumes every inference endpoint speaks OpenAI
tool_calls. Dogfooding turned up one model family that emits tools as markdown code blocks instead, which doesn't currently work with the loop. The Agent CR will gain atoolProtocolfield and the executor will gain adapters in v0.5; for now we publish a calibrated model-compatibility table in the docs.
The v0.1 CRD shape was designed so each of those additions is a non-breaking extension. Pinning the foundation is the work of this release; everything above is what we build on it.
Welcome, adebrie
This release also lands Intel GPU support from a first-time contributor. adebrie opened PR #557, which adds an Intel oneAPI / SYCL path across the controller and the metal-agent. It's a real addition to the heterogeneous-fleet story: LLMKube already covered NVIDIA CUDA and Apple Metal, and Intel Arc / Data Center GPU Max gets a viable on-prem deployment target on top of the same operator. The kind of correctly-scoped infrastructure contribution that fits the LLMKube shape perfectly. Thank you, adebrie, and welcome.
Upgrade notes
LLMKube 0.8.0 is backward-compatible with the 0.7.x line. The core operator, CRDs (inference.llmkube.dev), and metal-agent surfaces are unchanged. Foreman is opt-in:
- Don't want Foreman? Don't install its chart. Your existing
ModelandInferenceServiceresources behave the same as in 0.7.x. No new namespaces, no new CRDs registered. - Trying Foreman? Install the
foremanHelm chart on top of an existing LLMKube install. It registers four new CRDs (Workload,AgenticTask,Agent,FleetNode) under theforeman.llmkube.devAPI group, runs a separate controller and node-agent Deployment, and depends on LLMKube core for the underlyingInferenceServices its agents call. - Intel GPU users: the oneAPI / SYCL path is wired through the standard
hardware.acceleratorfield. See the new section in the deployment docs.
Install
# Core LLMKube (unchanged from 0.7.x)
helm repo add llmkube https://defilantech.github.io/LLMKube
helm repo update
helm upgrade --install llmkube llmkube/llmkube \
--namespace llmkube-system --create-namespace
# Foreman add-on (new in 0.8.0)
helm install foreman llmkube/foreman \
--namespace foreman-system --create-namespace
# CLI (macOS / Linux)
brew upgrade defilantech/tap/llmkube Full changelog on the v0.8.0 release page. Foreman docs live under /docs/foreman, including the install runbook, CRD reference, and the model-compatibility table.
What's next
The v0.2 work above (DAGs, best-of-N, autonomous planner, tool-protocol adapters) is the next layer. We'll keep dogfooding Foreman against this repository: the more it runs against real issues, the faster the rough edges surface and the more confident we get in publishing the patterns that work. The model-compatibility table in the docs is the first artifact of that pattern, and we'll add to it as we test new models.
If your shop fits the profile above (on-prem hardware, sovereignty constraint, K8s already in production), we'd love to hear about your fleet. File issues, open a thread on Discord, or just star the repo. And if you'd like Foreman to fix a small issue against your own project as a smoke test of the pattern: that's exactly the kind of feedback that shaped this release.