Skip to content

Frequently asked questions

Operational answers about LLMKube. Looking for how it stacks up against other projects? See how LLMKube compares.

Does it work with Argo CD / Flux?

Yes. LLMKube uses standard Kubernetes Custom Resource Definitions (CRDs). Your Model and InferenceService manifests are just YAML files that Argo CD and Flux handle like any other Kubernetes resource.

Store your model definitions in Git, let your GitOps tool sync them, and LLMKube's operator handles the rest. No special integration required.

What about AMD GPUs?

Not yet, but it's on the roadmap.

Currently we support NVIDIA GPUs (CUDA) and Apple Silicon (Metal). AMD ROCm support is planned for a future release. The llama.cpp backend we use already supports ROCm, so it's primarily an integration and testing effort.

If AMD support is critical for your use case, open an issue on GitHub. Community interest helps us prioritize.

Does it support auto-scaling / HPA?

Yes. HPA autoscaling shipped in v0.6.0. InferenceService supports native Kubernetes Horizontal Pod Autoscaler with per-runtime metrics.

Set minReplicas, maxReplicas, and a target metric value in your InferenceService spec, and LLMKube handles the rest. Each pluggable runtime (llama.cpp, vLLM, TGI) provides its own default HPA metric through the HPAMetricProvider interface.

For example, the llama.cpp runtime scales on llamacpp:requests_processing, while vLLM uses its own queue-depth metrics. You can also specify custom metrics if your setup uses Prometheus Adapter.

What models are supported?

LLMKube supports multiple model formats across its pluggable runtimes:

• GGUF via llama.cpp — most popular open models • SafeTensors and PyTorch via vLLM and TGI • MLX via the metal-agent's oMLX backend • Any format via the generic runtime with custom containers

We also ship a catalog of 20+ pre-configured models that deploy with a single command (e.g., llmkube deploy llama-3.1-8b --gpu): Llama, Mistral, Qwen, DeepSeek, Mixtral, Phi, and more. With the Ollama backend on the metal-agent, models download automatically from the Ollama registry.

How does air-gapped deployment work?

For air-gapped environments, you download models once on a connected system, then transfer them to your air-gapped cluster via your approved data transfer process.

LLMKube's persistent model cache stores models on a PVC. Once the model files are on the PVC (however you get them there), deployments work identically to connected environments. No external network calls required during inference.

We're also working on documentation for pre-loading container images and model files for fully disconnected installations.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.