How We Got Native Metal GPU Performance in Kubernetes (Without Containers)

Everyone trying to run LLMs on Apple Silicon with Kubernetes hits the same wall: containers on macOS run inside a Linux VM, and that VM cannot access Metal GPUs. You get CPU-only inference at maybe 5-15 tokens per second instead of 40-60 with Metal. We solved this by inverting the architecture: instead of bringing the GPU into the container, we brought Kubernetes orchestration out to the GPU. Here is how the LLMKube Metal Agent works and why this approach gets native performance.

The Problem Everyone Hits

Docker Desktop, Podman, and minikube all run Linux containers on macOS through a lightweight Linux VM. On Linux, GPU passthrough to containers is a solved problem: NVIDIA's container toolkit exposes CUDA devices directly. But on macOS, the host GPU speaks Metal, and the Linux VM inside the container has no way to talk to it.

This is not a configuration problem. It is a fundamental architectural constraint. The Linux kernel running inside the VM does not have Metal drivers. It cannot have Metal drivers. Metal is an Apple-proprietary API that only exists in macOS and iOS.

Red Hat has been doing interesting work on paravirtualized GPU access, translating Vulkan calls from the VM guest to the macOS host. Their research shows 75-95% of native performance, which is promising. But it is still experimental, requires specific VM configurations, and does not solve the Metal-specific problem for inference workloads that are optimized for the Metal backend.

The practical impact:

A Mac Studio M4 Max can run Llama 3.1 8B at 40-60 tokens per second natively with Metal acceleration. The same model inside a container on the same machine? Roughly 5-15 tokens per second on CPU-only, since the Linux VM has no Metal access. That is a 4-8x performance gap — the difference between a responsive tool and a frustrating one.

The Architectural Insight

The conventional approach is to try harder to get GPU access inside the container. Paravirtualization, API remoting, driver translation. These are all attempts to bridge the VM boundary.

We asked a different question: what if the inference process does not need to be inside the container at all?

Kubernetes is fundamentally a scheduling and orchestration system. It decides what should run, where, and when. The actual compute can happen anywhere, as long as the result is reachable as a Service endpoint. This is the same principle behind external services and headless services in Kubernetes. The control plane manages the lifecycle; the workload runs wherever it needs to.

Option A: Remote Cluster (Recommended, v0.4.16+)

Linux Server / Cloud              Mac (Apple Silicon)
+-- Kubernetes                     +-- Metal Agent (--host-ip)
|   +-- LLMKube Operator     ◄────►|   +-- Watches K8s API
|   +-- InferenceService CRD       |   +-- Spawns llama-server natively
|   +-- Service → Mac IP           |
                                   +-- llama-server (full Metal GPU access)
                                       +-- All unified memory for inference

Option B: Co-located (minikube on same Mac)

macOS Host
+-- Minikube (K8s in VM)
|   +-- InferenceService CRD triggers Metal Agent
|
+-- Metal Agent (native macOS process, launchd)
|   +-- Watches K8s API, spawns llama-server natively
|
+-- llama-server (full Metal GPU access)
    +-- Registered back as K8s Service endpoint

The Metal Agent runs as a native macOS process, managed by launchd. It watches the Kubernetes API for InferenceService CRDs, then spawns llama-server directly on the macOS host with full Metal GPU access. The inference server is registered back into Kubernetes as a Service endpoint, so from the cluster's perspective, it looks like any other service. Pods can reach it. Ingress can route to it. Health checks work. The abstraction is clean.

Same CRDs, Any GPU

The most important design goal was that users should not need to think about the underlying GPU architecture. The same InferenceService YAML that works on a Linux node with CUDA should work on a macOS node with Metal. The Metal Agent handles the translation transparently.

Here is a CUDA deployment on Linux, the existing pattern:

apiVersion: llmkube.io/v1alpha1
kind: InferenceService
metadata:
  name: qwen3-32b
  namespace: llmkube-system
spec:
  model:
    source: huggingface
    name: Qwen/Qwen3-32B-GGUF
  accelerator: cuda
  resources:
    gpuCount: 2

And here is a Metal deployment on macOS:

apiVersion: llmkube.io/v1alpha1
kind: InferenceService
metadata:
  name: qwen3-30b
  namespace: llmkube-system
spec:
  model:
    source: huggingface
    name: Qwen/Qwen3-30B-A3B-GGUF
  accelerator: metal
  resources:
    gpuLayers: 99

Same CRD. Same API. The accelerator field changes from cuda to metal, and the resource specification shifts from GPU count to GPU layers (because Metal uses unified memory, not discrete VRAM). Everything else is identical. The operator detects the accelerator type and routes to either the standard container-based deployment path or the Metal Agent.

How the Metal Agent Works

The Metal Agent has four components, each with a focused responsibility:

Watcher: Monitors the Kubernetes API for InferenceService resources with accelerator: metal. When one is created, updated, or deleted, it triggers the appropriate lifecycle action. The watcher uses the standard Kubernetes client library and maintains a persistent watch connection.
Executor: Spawns and manages llama-server processes natively on the macOS host. It configures Metal acceleration, sets GPU layer offloading, binds to a local port, and handles process lifecycle including health checks, restarts, and graceful shutdown.
Registry: Creates Kubernetes Service and Endpoints resources that point back to the host machine where llama-server is running. This is what makes the inference server discoverable from inside the cluster. Other pods hit the service name, and Kubernetes routes traffic to the native process.
Agent: The orchestration layer that ties the other three together. It manages the overall lifecycle: when a new InferenceService appears, the agent coordinates model download, server startup, health check verification, and service registration in the correct order.

Why launchd instead of running inside Kubernetes?

The Metal Agent needs to run as a native macOS process to access Metal GPUs. We use launchd because it is the standard macOS service manager. It handles automatic restart on failure, log management, and startup at boot. The agent is a single Go binary with no dependencies beyond llama.cpp and a valid kubeconfig.

Performance

Because the Metal Agent runs inference natively on the host, there is no virtualization overhead. You get the full performance of the Metal backend. Here is what we expect on an M4 Max with 128GB unified memory, compared to CUDA on an RTX 5060 Ti:

Model	M4 Max (Metal)	RTX 5060 Ti (CUDA)
Llama 3.2 3B	80-120 tok/s	53 tok/s
Llama 3.1 8B	40-60 tok/s	52 tok/s
Qwen3 30B-A3B (MoE)	45-65 tok/s	N/A (needs >16GB VRAM)

The M4 Max numbers are expected ranges based on the Metal backend in llama.cpp. The interesting row is the last one: Qwen3-30B-A3B is a Mixture of Experts model that requires more than 16GB of VRAM. It cannot run on a single RTX 5060 Ti at all, but the M4 Max with 128GB unified memory handles it comfortably. This is where Apple Silicon has a genuine advantage for inference: massive unified memory that lets you run models that would require multi-GPU setups on discrete hardware.

A note on these numbers:

Performance is comparable to running ollama or llama-server directly. That is exactly the point. The Metal Agent adds Kubernetes orchestration without adding performance overhead. The inference path is identical to running the server natively, because it is running the server natively.

Why Not Just Use Ollama?

Fair question. Ollama is excellent for what it does: single-machine, single-user LLM serving with a great developer experience. If that is your use case, Ollama is probably the right tool.

The Metal Agent solves a different problem. It is for teams and workflows that need Kubernetes-native orchestration across heterogeneous hardware:

Capability	Ollama	Metal Agent
Quick local setup	Excellent	Requires K8s
Kubernetes CRDs	No	Yes
Model lifecycle management	Basic	Full (download, health, restart)
Same workflow on CUDA + Metal	No	Yes
Multi-node orchestration	No	Yes (via K8s)

This is not a criticism of Ollama. These are different tools for different problems. If you just want to chat with a model on your Mac, use Ollama. If you need to manage model deployments across a mix of CUDA and Metal hardware with the same CRDs, that is what the Metal Agent is for.

Real-World Use Case

We are running both GPU architectures in our own infrastructure right now, managed by LLMKube with the same CRDs:

ShadowStack (Linux, 2x RTX 5060 Ti): Running Qwen3-32B via CUDA with layer-based sharding across both GPUs. This handles heavier reasoning workloads that benefit from the larger parameter count.
Mac Studio M4 Max (macOS): Running Qwen3-30B-A3B via the Metal Agent. The MoE architecture means only 3B parameters are active per token, giving fast inference on the unified memory architecture.

Both nodes back a set of interconnected services: internal APIs that use LLM-driven analysis, a Discord bot that handles team queries, and automation workflows that call the inference endpoint directly. The applications do not know or care which backend is serving their requests. They hit a Kubernetes Service, and the cluster routes to whichever node has capacity.

The operational value is that we manage everything through the same kubectl and llmkube workflows. Deploy a model, check its status, view logs, scale it down. The commands are identical whether the underlying hardware is NVIDIA or Apple Silicon.

No Kubernetes on the Mac

Starting with v0.4.16, the Metal Agent supports a --host-ip flag that changes the architecture entirely: Kubernetes runs on a Linux server (or cloud), and the Mac is a pure inference node. No minikube, no Podman VM, no K8s overhead. Every byte of unified memory goes to the model.

This matters more than it sounds. A Mac Studio with 128GB unified memory running minikube reserves several gigabytes for the VM, the kubelet, etcd, and the control plane components. With a remote cluster, that memory is reclaimed for inference. On a 36GB Mac Mini, the difference between 30GB available for a model and 34GB available can determine whether a 30B-parameter model fits entirely in GPU memory or spills to CPU.

Remote Cluster Architecture (v0.4.16+)

┌──────────────────────────┐       ┌──────────────────────────┐
│ Linux Server / Cloud     │       │ Mac (Apple Silicon)      │
│                          │       │                          │
│  Kubernetes              │ LAN/  │  Metal Agent             │
│   LLMKube Operator       │ VPN/  │   --host-ip <mac-ip>    │
│   InferenceService CRD   │◄─────►│   Watches K8s API       │
│   Service → Mac IP       │ TLS   │   Spawns llama-server   │
│                          │       │                          │
│  CUDA nodes (optional)   │       │  llama-server (Metal)   │
│   llama.cpp containers   │       │   Full GPU access ✅     │
└──────────────────────────┘       └──────────────────────────┘

The --host-ip flag tells the Metal Agent to register the Mac's reachable IP address (instead of localhost) when creating Kubernetes Service endpoints. This means pods in the remote cluster can route inference traffic to the Mac over any network: LAN, Tailscale, WireGuard, or any routable path. Same CRDs, same llmkube deploy, regardless of where Kubernetes runs.

Try It Yourself

If you have a Mac with Apple Silicon, you can set this up in about 10 minutes. The recommended path uses an existing Kubernetes cluster (fewer prerequisites, more memory for inference). If you do not have a cluster yet, the minikube path works too.

Option A: Existing Kubernetes cluster (Recommended)

# Install llama.cpp
brew install llama.cpp

# Download Metal Agent from GitHub Releases
# https://github.com/defilantech/llmkube/releases

# Point at your cluster (copy kubeconfig from your server)
export KUBECONFIG=~/.kube/config

# Start the Metal Agent with your Mac's IP
llmkube-metal-agent --host-ip $(ipconfig getifaddr en0)

# Deploy a model with Metal acceleration
llmkube deploy llama-3.1-8b --accelerator metal

Option B: Local minikube (single-machine)

# Install prerequisites
brew install llama.cpp minikube

# Start minikube
minikube start

# Install LLMKube operator
helm repo add llmkube https://defilantech.github.io/LLMKube
helm install llmkube llmkube/llmkube --namespace llmkube-system --create-namespace

# Install Metal Agent
make install-metal-agent

# Deploy a model with Metal acceleration
llmkube deploy llama-3.1-8b --accelerator metal

The Metal Agent installs as a launchd service. Once running, it automatically picks up any InferenceService with accelerator: metal and handles the rest: model download, server startup, health checks, and service registration.

For the full walkthrough, check the Metal quickstart guide. If you run into issues, open an issue on GitHub.

Further reading: The Metal quickstart guide covers installation in detail. The LLMKube repository has the full source for the operator and Metal Agent. And the getting started guide walks through the complete setup from scratch.

Ready to Run LLMs on Apple Silicon?

The Metal Agent brings Kubernetes-native orchestration to macOS with full Metal GPU performance. Same CRDs as CUDA. No containers required for inference.

View on GitHub Getting Started