Skip to content
Skip to documentation content
Browse documentation

macOS Metal Agent

LLMKube’s Metal Agent is the thing no other Kubernetes LLM tool does. It lets your Mac Studio, Mac mini, or any Apple Silicon machine serve as a first-class Kubernetes inference node — with the same InferenceService CRD you use for NVIDIA GPUs and without losing access to Metal because the workload is trapped in a container.

The shape: a native macOS daemon (the agent) watches the Kubernetes API for InferenceService resources marked accelerator: metal, spawns llama-server processes natively with full Metal GPU access, and registers endpoints back into the cluster so any pod can route traffic to your Mac over LAN / Tailscale / WireGuard.

This guide gets you from a fresh Apple Silicon machine to a running Metal-accelerated InferenceService in about ten minutes.

Prerequisites

  • Apple Silicon Mac (M1 / M2 / M3 / M4 / M5). Intel Macs with Metal 2+ work but performance is materially worse.
  • Access to a Kubernetes cluster — either remote (recommended; most production patterns put the Mac on the LAN as an inference node) or local (minikube / kind on Docker Desktop).
  • kubectl configured against your cluster.
  • Homebrew (or any equivalent way to install llama.cpp with Metal support).

Step 1: Install llama.cpp with Metal

brew install llama.cpp

Verify Metal is detected:

system_profiler SPDisplaysDataType | grep Metal
# expected: Metal Support: Metal 3

Step 2: Install the LLMKube operator in your cluster

If you haven’t already, install the operator. The Metal Agent relies on the operator’s CRDs being installed and the controller running.

helm repo add llmkube https://defilantech.github.io/LLMKube
helm install llmkube llmkube/llmkube 
  -n llmkube-system --create-namespace

For OpenShift clusters, add -f values-openshift.yaml (see OpenShift install).

Step 3: Install and start the Metal Agent

Clone the operator repo on your Mac and run the bundled installer:

git clone https://github.com/defilantech/LLMKube.git
cd LLMKube
make install-metal-agent

This builds the agent binary, installs to /usr/local/bin/llmkube-metal-agent, drops a launchd plist into ~/Library/LaunchAgents/, and starts the service. On a fresh Mac the whole thing takes about twenty seconds.

If you need to install manually (different binary path, different launch system), see the deployment/macos/README.md for the full plist and launchctl commands.

Verify the agent is running

launchctl list | grep llmkube
# expected: <PID>   0   com.llmkube.metal-agent

curl -s http://localhost:9090/healthz
# expected: {"status":"ok"}

tail -f /tmp/llmkube-metal-agent.log
# leave this tab open; we'll watch it pick up the first InferenceService

Step 4: Remote cluster setup

If your Kubernetes cluster runs on a different machine (a Linux server or cloud cluster, as opposed to local kind / minikube), the agent needs to register your Mac’s reachable IP so cluster pods can route to llama-server on your Mac.

# Find your Mac's IP on the LAN
ipconfig getifaddr en0
# example: 192.168.1.50

# Or on Tailscale / WireGuard
tailscale status | head -2

Edit ~/Library/LaunchAgents/com.llmkube.metal-agent.plist and add to the ProgramArguments array:

<string>--host-ip</string>
<string>192.168.1.50</string>

Reload:

launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

Without --host-ip the agent registers localhost as the endpoint, which only works when Kubernetes lives on the same Mac (local minikube or Docker Desktop kind).

Step 5: Deploy a model with Metal

From any machine that can talk to your cluster:

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata: { name: phi-4-mini }
spec:
  source: https://huggingface.co/bartowski/phi-4-mini-instruct-GGUF/resolve/main/phi-4-mini-instruct-Q4_K_M.gguf
  format: gguf
  hardware:
    accelerator: metal
---
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata: { name: phi-4-mini }
spec:
  modelRef: phi-4-mini
kubectl apply -f phi-4-mini.yaml
kubectl get inferenceservice phi-4-mini -w
# wait for PHASE=Ready

The agent’s log should show:

"msg":"starting inference service","name":"phi-4-mini"
"msg":"registered endpoint","hostIP":"192.168.1.50","port":<allocated>
"msg":"started inference service","name":"phi-4-mini","pid":<llama-server-pid>

Query the model from anywhere in the cluster:

kubectl port-forward svc/phi-4-mini 8080:8080 &
curl -sS http://localhost:8080/v1/chat/completions 
  -H 'content-type: application/json' 
  -d '{"model":"phi-4-mini","messages":[{"role":"user","content":"hi"}]}'

Memory budgets

The agent estimates each model’s memory cost (weights + KV cache + overhead) before spawning llama-server. If the model won’t fit in the configured budget, the agent refuses to start it and marks the InferenceService with status.schedulingStatus: InsufficientMemory.

Defaults are tuned by total system RAM:

Total RAMDefault fractionBudget
16 GB67%~10.7 GB
36 GB67%~24.1 GB
48 GB75%36 GB
64 GB75%48 GB
128 GB90%115 GB

Override the fraction with --memory-fraction 0.9 for a dedicated inference machine, or 0.5 if the Mac is also your daily-driver workstation. Add the flag to the launchd plist’s ProgramArguments the same way as --host-ip.

The agent also implements memory-pressure protection: if macOS reports critical memory pressure, the agent can evict the lowest-priority running InferenceService and refuse to spawn new ones until pressure normalizes. See the Memory-pressure protection guide for tuning.

ModelRouter integration

The Metal Agent’s InferenceServices are first-class targets for the ModelRouter CRD. Reference them by name like any other local backend:

apiVersion: inference.llmkube.dev/v1alpha1
kind: ModelRouter
metadata: { name: hybrid-router }
spec:
  backends:
    - name: local-mac
      inferenceServiceRef: { name: phi-4-mini }   # the InferenceService above
      tier: local
      capabilities: [chat]
    - name: cloud-opus
      external:
        provider: anthropic
        model: claude-opus-4-7
        credentialsSecretRef: { name: anthropic-key }
      tier: cloud
  rules:
    - name: pii-stays-on-mac
      match: { dataClassification: [pii] }
      route: { backends: [local-mac] }
      failClosed: true
  defaultRoute: local-mac

The router-proxy pod (which the controller schedules in the cluster, not on the Mac) dials the agent-registered endpoint when the rule resolves to local-mac. From the router’s perspective the Mac-served backend is indistinguishable from a container-served one — same InferenceServiceRef shape, same fail-closed semantics, same per-rule timeout budgets.

See the ModelRouter concept doc for the full policy model.

Cross-cluster fleet shape

Heterogeneous clusters are the strongest pattern: NVIDIA nodes in a cloud for heavy workloads, Mac Studios on-prem for low-latency / sensitive work, all managed by the same controller with the same CRDs. The agent makes the Mac visible to the controller exactly like a Linux node visible to a Deployment reconciler — just with accelerator: metal instead of accelerator: cuda on the Model.

Operationally:

  • Put the Mac on the same VPN / Tailscale tailnet as your cluster’s worker nodes.
  • Set --host-ip to the Mac’s address on that network.
  • The controller routes all accelerator: metal InferenceServices to whatever agent is registered for that endpoint.

Optional: Apple Silicon power metrics

For InferCost (LLMKube’s companion FinOps project) per-token cost attribution on Apple Silicon, the agent can publish CPU / GPU / ANE / Combined power gauges sourced from powermetrics. This is disabled by default because powermetrics requires root.

Enable in three steps:

  1. Install the bundled NOPASSWD sudoers fragment, which pins both the binary path and the argument vector so the grant is the narrowest possible:

    make install-powermetrics-sudo
  2. Add --apple-power-enabled to the launchd plist’s ProgramArguments array.

  3. Reload the agent.

The four gauges exposed: llmkube_metal_agent_apple_power_combined_watts, llmkube_metal_agent_apple_power_gpu_watts, llmkube_metal_agent_apple_power_cpu_watts, llmkube_metal_agent_apple_power_ane_watts.

See the deployment/macos/README.md for the full sudoers setup and a manual install path that lets you inspect each step before running it.

Troubleshooting

Agent process not running after install Check /tmp/llmkube-metal-agent.log (the StandardOutPath/StandardErrorPath configured in the bundled launchd plist) for the first-launch error. Most common cause: llama-server not on PATH or at the configured --llama-server path.

Pods can’t reach llama-server (remote cluster) The agent registered localhost. Confirm --host-ip is set in the plist and points at an address reachable from your cluster’s worker nodes:

# From a worker node:
ping <your-mac-ip>
curl http://<your-mac-ip>:<allocated-port>/v1/models

If those work but routing through the cluster Service fails, check the registered Endpoints object:

kubectl get endpoints <inferenceservice-name>
# expect: subsets[0].addresses[0].ip = your Mac's --host-ip

InferenceService stuck in InsufficientMemory The agent’s pre-flight estimator says the model won’t fit. Either shrink the model (use a smaller quantization), reduce the context size in the InferenceService spec, or raise --memory-fraction. If the Mac is the only Mac in the cluster and this is a dedicated inference machine, 0.9 is reasonable.

macOS firewall prompt on first run The Metal Agent listens on 127.0.0.1:9090 for its own health/metrics, and llama-server listens on an allocated port for inbound inference. macOS will prompt to allow incoming connections on first run. Allow them.

Agent log shows replicas=0; stopping process unexpectedly A controller-side reconcile saw spec.replicas=0 on the InferenceService. Check whether something scaled it down (another operator, a Helm upgrade reverting your spec, an operator-managed argocd app pulling a stale value).

Uninstall

cd /path/to/LLMKube-checkout
make uninstall-metal-agent

That tears down the launchd service, removes the binary from /usr/local/bin, and deletes the plist. Model weights downloaded into the agent’s --model-store path stay on disk (the agent doesn’t clean those up; remove manually if needed).

Reference

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.