macOS Metal Agent
LLMKube’s Metal Agent is the thing no other Kubernetes LLM tool
does. It lets your Mac Studio, Mac mini, or any Apple Silicon
machine serve as a first-class Kubernetes inference node — with
the same InferenceService CRD you use for NVIDIA GPUs and
without losing access to Metal because the workload is trapped in
a container.
The shape: a native macOS daemon (the agent) watches the
Kubernetes API for InferenceService resources marked accelerator: metal, spawns llama-server processes natively
with full Metal GPU access, and registers endpoints back into the
cluster so any pod can route traffic to your Mac over LAN /
Tailscale / WireGuard.
This guide gets you from a fresh Apple Silicon machine to a running Metal-accelerated InferenceService in about ten minutes.
Prerequisites
- Apple Silicon Mac (M1 / M2 / M3 / M4 / M5). Intel Macs with Metal 2+ work but performance is materially worse.
- Access to a Kubernetes cluster — either remote (recommended; most production patterns put the Mac on the LAN as an inference node) or local (minikube / kind on Docker Desktop).
- kubectl configured against your cluster.
- Homebrew (or any equivalent way to install
llama.cppwith Metal support).
Step 1: Install llama.cpp with Metal
brew install llama.cpp Verify Metal is detected:
system_profiler SPDisplaysDataType | grep Metal
# expected: Metal Support: Metal 3 Step 2: Install the LLMKube operator in your cluster
If you haven’t already, install the operator. The Metal Agent relies on the operator’s CRDs being installed and the controller running.
helm repo add llmkube https://defilantech.github.io/LLMKube
helm install llmkube llmkube/llmkube
-n llmkube-system --create-namespace For OpenShift clusters, add -f values-openshift.yaml (see OpenShift install).
Step 3: Install and start the Metal Agent
Clone the operator repo on your Mac and run the bundled installer:
git clone https://github.com/defilantech/LLMKube.git
cd LLMKube
make install-metal-agent This builds the agent binary, installs to /usr/local/bin/llmkube-metal-agent, drops a launchd plist into ~/Library/LaunchAgents/, and starts the service. On a fresh Mac
the whole thing takes about twenty seconds.
If you need to install manually (different binary path, different
launch system), see the deployment/macos/README.md for the full plist and launchctl commands.
Verify the agent is running
launchctl list | grep llmkube
# expected: <PID> 0 com.llmkube.metal-agent
curl -s http://localhost:9090/healthz
# expected: {"status":"ok"}
tail -f /tmp/llmkube-metal-agent.log
# leave this tab open; we'll watch it pick up the first InferenceService Step 4: Remote cluster setup
If your Kubernetes cluster runs on a different machine (a Linux
server or cloud cluster, as opposed to local kind / minikube), the
agent needs to register your Mac’s reachable IP so cluster pods
can route to llama-server on your Mac.
# Find your Mac's IP on the LAN
ipconfig getifaddr en0
# example: 192.168.1.50
# Or on Tailscale / WireGuard
tailscale status | head -2 Edit ~/Library/LaunchAgents/com.llmkube.metal-agent.plist and
add to the ProgramArguments array:
<string>--host-ip</string>
<string>192.168.1.50</string> Reload:
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load ~/Library/LaunchAgents/com.llmkube.metal-agent.plist Without --host-ip the agent registers localhost as the
endpoint, which only works when Kubernetes lives on the same Mac
(local minikube or Docker Desktop kind).
Step 5: Deploy a model with Metal
From any machine that can talk to your cluster:
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata: { name: phi-4-mini }
spec:
source: https://huggingface.co/bartowski/phi-4-mini-instruct-GGUF/resolve/main/phi-4-mini-instruct-Q4_K_M.gguf
format: gguf
hardware:
accelerator: metal
---
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata: { name: phi-4-mini }
spec:
modelRef: phi-4-mini kubectl apply -f phi-4-mini.yaml
kubectl get inferenceservice phi-4-mini -w
# wait for PHASE=Ready The agent’s log should show:
"msg":"starting inference service","name":"phi-4-mini"
"msg":"registered endpoint","hostIP":"192.168.1.50","port":<allocated>
"msg":"started inference service","name":"phi-4-mini","pid":<llama-server-pid> Query the model from anywhere in the cluster:
kubectl port-forward svc/phi-4-mini 8080:8080 &
curl -sS http://localhost:8080/v1/chat/completions
-H 'content-type: application/json'
-d '{"model":"phi-4-mini","messages":[{"role":"user","content":"hi"}]}' Memory budgets
The agent estimates each model’s memory cost (weights + KV cache +
overhead) before spawning llama-server. If the model won’t fit
in the configured budget, the agent refuses to start it and marks
the InferenceService with status.schedulingStatus: InsufficientMemory.
Defaults are tuned by total system RAM:
| Total RAM | Default fraction | Budget |
|---|---|---|
| 16 GB | 67% | ~10.7 GB |
| 36 GB | 67% | ~24.1 GB |
| 48 GB | 75% | 36 GB |
| 64 GB | 75% | 48 GB |
| 128 GB | 90% | 115 GB |
Override the fraction with --memory-fraction 0.9 for a dedicated
inference machine, or 0.5 if the Mac is also your daily-driver
workstation. Add the flag to the launchd plist’s ProgramArguments the same way as --host-ip.
The agent also implements memory-pressure protection: if
macOS reports critical memory pressure, the agent can evict the
lowest-priority running InferenceService and refuse to spawn new
ones until pressure normalizes. See the Memory-pressure protection guide for tuning.
ModelRouter integration
The Metal Agent’s InferenceServices are first-class targets for
the ModelRouter CRD. Reference them by name like any other
local backend:
apiVersion: inference.llmkube.dev/v1alpha1
kind: ModelRouter
metadata: { name: hybrid-router }
spec:
backends:
- name: local-mac
inferenceServiceRef: { name: phi-4-mini } # the InferenceService above
tier: local
capabilities: [chat]
- name: cloud-opus
external:
provider: anthropic
model: claude-opus-4-7
credentialsSecretRef: { name: anthropic-key }
tier: cloud
rules:
- name: pii-stays-on-mac
match: { dataClassification: [pii] }
route: { backends: [local-mac] }
failClosed: true
defaultRoute: local-mac The router-proxy pod (which the controller schedules in the
cluster, not on the Mac) dials the agent-registered endpoint
when the rule resolves to local-mac. From the router’s
perspective the Mac-served backend is indistinguishable from a
container-served one — same InferenceServiceRef shape, same
fail-closed semantics, same per-rule timeout budgets.
See the ModelRouter concept doc for
the full policy model.
Cross-cluster fleet shape
Heterogeneous clusters are the strongest pattern: NVIDIA nodes in
a cloud for heavy workloads, Mac Studios on-prem for
low-latency / sensitive work, all managed by the same controller
with the same CRDs. The agent makes the Mac visible to the
controller exactly like a Linux node visible to a Deployment reconciler — just with accelerator: metal instead of accelerator: cuda on the Model.
Operationally:
- Put the Mac on the same VPN / Tailscale tailnet as your cluster’s worker nodes.
- Set
--host-ipto the Mac’s address on that network. - The controller routes all
accelerator: metalInferenceServices to whatever agent is registered for that endpoint.
Optional: Apple Silicon power metrics
For InferCost (LLMKube’s
companion FinOps project) per-token cost attribution on Apple
Silicon, the agent can publish CPU / GPU / ANE / Combined power
gauges sourced from powermetrics. This is disabled by default because powermetrics requires root.
Enable in three steps:
Install the bundled NOPASSWD sudoers fragment, which pins both the binary path and the argument vector so the grant is the narrowest possible:
make install-powermetrics-sudoAdd
--apple-power-enabledto the launchd plist’sProgramArgumentsarray.Reload the agent.
The four gauges exposed: llmkube_metal_agent_apple_power_combined_watts, llmkube_metal_agent_apple_power_gpu_watts, llmkube_metal_agent_apple_power_cpu_watts, llmkube_metal_agent_apple_power_ane_watts.
See the deployment/macos/README.md for the full sudoers setup and a manual install path that lets you
inspect each step before running it.
Troubleshooting
Agent process not running after install Check /tmp/llmkube-metal-agent.log (the StandardOutPath/StandardErrorPath configured in the bundled
launchd plist) for the first-launch error. Most common cause: llama-server not on PATH or at the configured --llama-server path.
Pods can’t reach llama-server (remote cluster) The agent registered localhost. Confirm --host-ip is set in
the plist and points at an address reachable from your cluster’s
worker nodes:
# From a worker node:
ping <your-mac-ip>
curl http://<your-mac-ip>:<allocated-port>/v1/models If those work but routing through the cluster Service fails, check the registered Endpoints object:
kubectl get endpoints <inferenceservice-name>
# expect: subsets[0].addresses[0].ip = your Mac's --host-ip InferenceService stuck in InsufficientMemory The agent’s pre-flight estimator says the model won’t fit. Either
shrink the model (use a smaller quantization), reduce the context
size in the InferenceService spec, or raise --memory-fraction. If the Mac is the only Mac in the cluster and
this is a dedicated inference machine, 0.9 is reasonable.
macOS firewall prompt on first run The Metal Agent listens on 127.0.0.1:9090 for its own
health/metrics, and llama-server listens on an allocated port
for inbound inference. macOS will prompt to allow incoming
connections on first run. Allow them.
Agent log shows replicas=0; stopping process unexpectedly A controller-side reconcile saw spec.replicas=0 on the InferenceService. Check whether something scaled it down
(another operator, a Helm upgrade reverting your spec, an
operator-managed argocd app pulling a stale value).
Uninstall
cd /path/to/LLMKube-checkout
make uninstall-metal-agent That tears down the launchd service, removes the binary from /usr/local/bin, and deletes the plist. Model weights downloaded
into the agent’s --model-store path stay on disk (the agent
doesn’t clean those up; remove manually if needed).
Reference
deployment/macos/README.md— full reference including manual install, launchd plist tuning, Prometheus metrics enumeration, sudoers fragment internalsMemory-pressure protection— eviction tuning and the InferenceService priority fieldModel Router— policy-aware routing layer above Metal-served InferenceServicesAir-gapped install— combining Metal serving with offline / private-registry installs