Foreman M4 install runbook: gate-role Agent on a verifier node
This runbook walks through standing up the verifier role of the
v0.1 Foreman pipeline. By the end, your cluster has a
running foreman-operator + foreman-agent, advertises itself as a
FleetNode with roles=[worker,verifier], and is ready for the V3
two-step demo (M4 Phase F) where a coder Agent on the M5 Max produces
a branch and the gate Agent on the verifier node runs make fmt vet lint test against it in a Kubernetes Job.
If you only want the coder role on the M5 Max, read M3 coder runbook instead. The two runbooks are
independent: install whichever roles you have hardware for, in any
order.
What M4 ships
- CRD changes:
Agent.spec.inferenceServiceRefis now optional; empty means “deterministic Agent, dispatch the first tool directly”.AgenticTask.spec.requiredCapability.rolesfilters FleetNodes by advertised role. - New tool:
run_gate_jobsubmits a Kubernetes Job that clones a branch of the fork, runs the foreman gate check suite (fmt + vet + lint + test + manifests + chart-crds + foreman-chart-crds), and surfacesGATE-PASS,GATE-FAIL, orGATE-ERRORin the task’s verdict. - New chart pieces in
charts/foreman/: operator + agent Deployments, RBAC for both, theforeman-gate-cachePVC, and avalues.yamlthat exposes the M4 install knobs. - New images:
ghcr.io/defilantech/llmkube-foreman-operatorandghcr.io/defilantech/llmkube-foreman-agent, published at the chart’s.Chart.AppVersiontag (track the foreman chart release for the current tag).
Prerequisites
A Kubernetes cluster you have admin on. The runbook assumes a CUDA-equipped Linux Kubernetes node but any linux/amd64 cluster works; the gate Agent is deterministic and does not need a GPU.
kubectlconfigured for the target cluster:kubectl get nodeshelmv3.13 or newer.LLMKube core installed at
>=0.7.10. Foreman’s chart depends on it for theinference.llmkube.devCRDs the operator’s RBAC references:helm repo add defilantech https://defilantech.github.io/llmkube helm repo update helm upgrade --install llmkube defilantech/llmkube --namespace llmkube-system --create-namespaceUpgrading from an M3-era cluster? M4 widens both Foreman CRDs:
Agent.spec.inferenceServiceRefbecomes optional, andAgenticTask.spec.requiredCapability.rolesis new. The chart’s CRD sync handles this automatically, but if you ever apply CRDs outside the chart (e.g. directly fromconfig/crd/bases/), the M3 runbook’s “re-apply all four CRDs” callout still applies: an older AgenticTask CRD silently rejectsspec.requiredCapability.rolesunder strict decode.
Step 1: install Foreman
From the published chart repo (recommended once you have llmkube installed via the prerequisite above):
helm upgrade --install foreman defilantech/foreman --namespace foreman-system --create-namespace --set agent.mode=native --set 'agent.roles={worker,verifier}' For dev installs from a local checkout:
cd /path/to/LLMKube
helm upgrade --install foreman charts/foreman --namespace foreman-system --create-namespace --set agent.mode=native --set 'agent.roles={worker,verifier}' Foreman is a sibling chart to llmkube, not a subchart: installing foreman never installs (or expects to install) llmkube. If you have not already installed llmkube core in the same cluster, do that first per the prerequisite section above.
Foreman milestone status
Foreman is shipped in stages alongside the LLMKube release train. Track your install against this milestone matrix so you know what to expect from a given version:
| LLMKube version | Foreman state |
|---|---|
<0.7.10 | Foreman scaffold only (CRDs, no operator / agent) |
0.7.10+ | M3 coder + M4 verifier installable; 2-step pipeline |
| 0.8.0 (planned) | M5 reviewer + M6 Workload planner — full v0.1 debut |
The 0.8.0 “Foreman’s debut” tag will cut once the M6 Workload
planner lands; at that point a single kubectl apply -f workload.yaml will drive the full coder + gate + reviewer pipeline end to end. The
installs documented here work today against any release 0.7.10+;
the workflows the v0.1 vision promises arrive at 0.8.0.
For a gate-only verifier node (the M4 install), nothing else is
required: the deterministic gate Agent never clones or pushes from
the foreman-agent Pod (the gate Job clones inside its own Pod). For
nodes that will also run coder-role Agents (M3 + later), add the agent.gitRemoteURL + agent.commitAuthorEmail knobs:
helm upgrade --install foreman charts/foreman --namespace foreman-system --create-namespace --set agent.mode=native --set 'agent.roles={worker,verifier}' --set agent.gitRemoteURL=https://github.com/Defilan/LLMKube.git --set agent.commitAuthorEmail=foreman-bot@example.com What that produces:
- A
foreman-operatorDeployment (replicas=1) hosting the Agent, AgenticTask, FleetNode, and Workload reconcilers. - A
foreman-agentDeployment (replicas=1) that registers itself as a FleetNode advertisingaccelerator=cuda,roles=[worker, verifier], and the in-cluster InferenceService set. - ClusterRoles + ClusterRoleBindings for both, scoped to the
foreman.llmkube.devAPI group + Jobs + ConfigMaps + pod logs. - A
foreman-gate-cachePVC (20 GiB ReadWriteOnce) that therun_gate_jobtool mounts into every gate Job so GOMODCACHE, GOCACHE, and XDG_DATA_HOME persist across runs. The first cold run takes ~10 min; subsequent warm runs land in ~2-3 min.
Watch the operator come up:
kubectl -n foreman-system logs -l app.kubernetes.io/component=operator -f The “Starting workers” line for each of the four reconcilers (Agent, AgenticTask, FleetNode, Workload) means the operator is healthy.
Step 2: confirm the agent registered itself
kubectl get fleetnodes Expected:
NAME READY ACCELERATOR RAM AVAILABLE AGE
<verifier-node> True cuda 64Gi 60Gi 1m Then dig into the advertised roles:
kubectl get fleetnode -o yaml | grep -A4 'spec:' | grep -E 'roles:|- worker|- verifier' Expected:
roles:
- worker
- verifier If you see only - worker, the Helm install missed the roles override. Re-run the helm upgrade with the --set 'agent.roles=…' line above.
Step 3: optional smoke task (stub executor)
Foreman ships with a stub executor that does nothing but sleep + write a synthetic Succeeded result. It’s the cheapest way to convince yourself the scheduler routes correctly to the new FleetNode.
# /tmp/foreman-smoke.yaml
apiVersion: foreman.llmkube.dev/v1alpha1
kind: AgenticTask
metadata:
name: verifier-smoke
namespace: default
spec:
kind: freeform
payload:
prompt: "smoke test"
requiredCapability:
roles: [verifier]
timeoutSeconds: 60 # Temporarily flip the agent to stub mode for this smoke task --
# the stub executor ignores the Agent CR entirely so we don't need
# one yet.
helm upgrade --reuse-values foreman defilantech/foreman -n foreman-system --set agent.mode=stub
kubectl apply -f /tmp/foreman-smoke.yaml
kubectl get agentictasks -w
# Pending -> Scheduled (assignedNode=<verifier-node>) -> Running
# -> Succeeded within ~15s.
# Flip back to native mode before Phase F.
helm upgrade --reuse-values foreman defilantech/foreman -n foreman-system --set agent.mode=native Step 4: applying the gate Agent CR
The actual gate Agent + the V3 two-step demo manifests land with Phase F. Once those merge, the workflow is:
kubectl apply -f config/foreman/agents/gate.yaml
kubectl apply -f examples/foreman/m4-two-step-demo.yaml See M4 verifier runbook (lands with Phase F) for the
full V3 demo.
Troubleshooting
fleetnodes list is empty: the foreman-agent Pod has not
registered yet. Check its log: kubectl -n foreman-system logs -l app.kubernetes.io/component=agent. The most
common cause is the ServiceAccount missing create on fleetnodes; re-run helm upgrade to refresh the RBAC.
ClusterRole permission denied on FleetNode write: the chart’s
agent ClusterRole grants create/update/patch on foreman.llmkube.dev/fleetnodes. If you customized the chart’s
RBAC, diff against charts/foreman/templates/agent-rbac.yaml and
confirm the rule is intact.
foreman-gate-cache PVC stuck Pending: the cluster has no
default StorageClass, or the chosen StorageClass cannot satisfy
ReadWriteOnce. Set agent.gateCache.storageClass explicitly or
disable the PVC: --set agent.gateCache.enabled=false (gate runs
will still work, just without the inter-run cache).
Agent Pod startup info-log “—git-remote-url is unset”: not
an error. Foreman v0.7.10+ logs this as INFO and proceeds; coder
tasks that actually need the URL fail per-task with reason GitRemoteURLNotConfigured. Deterministic Agents (e.g. the M4
gate) run fine without it. To suppress the log line on a node that
will also run coder tasks, set --set agent.gitRemoteURL=https://github.com/Defilan/LLMKube.git
--set agent.commitAuthorEmail=...at install time.
If you are running a pre-v0.7.10 foreman-agent image where this is
a hard os.Exit(1), upgrade. The startup check was relaxed in the
M4 work that landed in 0.7.10.
What this does NOT cover
- The actual gate Job’s resource sizing for a particular cluster.
The Phase B
run_gate_jobtool ships defaults (2/4 CPU, 4Gi/8Gi memory, 30 min activeDeadlineSeconds) that match the autofix pipeline’s tolerance. Tune via the run_gate_job tool’s Config in a future Phase G if your cluster has different headroom. - The Mac Studio reviewer Agent. That ships in M5.
- Cross-cluster setups where the coder and gate Agents run in different Kubernetes clusters. v0.1 assumes a single cluster.
- Production-grade observability for the gate Job. The pod log tail
surfaces in
Result.Extra.logTail; a full ServiceMonitor for the gate’s resource usage is a v0.2 follow-up.