Frequently Asked Questions
Common questions about LLMKube, how it compares to alternatives, and what it can do for you.
Why not NVIDIA Dynamo?
NVIDIA Dynamo is built for enterprise GPU fleets (GB200, H100) and maximum throughput at massive scale. It's optimized for data centers with dedicated infrastructure teams.
LLMKube is for teams who need local inference running quickly on the hardware they have. We focus on simplicity: deploy LLMs on your existing Kubernetes cluster with a CLI and Helm chart. No PhD in distributed systems required.
If you have a fleet of H100s and a dedicated MLOps team, Dynamo might be right for you. If you need to deploy LLMs on your existing Kubernetes cluster without a PhD in distributed systems, that's what LLMKube is for.
Why not Ollama?
Ollama is excellent for single-node local development. We use it ourselves for quick testing.
LLMKube solves a different problem: multi-node orchestration in Kubernetes with production-grade observability. If you need to deploy LLMs across a cluster, integrate with existing K8s tooling (GitOps, Prometheus, etc.), or run in on-premise environments with Kubernetes-native orchestration, that's where LLMKube shines.
Think of it this way: Ollama is like running SQLite on your laptop. LLMKube is like running Postgres in production.
Why not vLLM?
As of v0.6.0, vLLM is a first-class pluggable runtime in LLMKube. You can deploy models with vLLM by setting runtime: vllm in your InferenceService spec.
The difference is scope: vLLM is a single-node inference engine focused on maximum throughput via PagedAttention and tensor parallelism. LLMKube is a Kubernetes operator that orchestrates inference at the cluster level, handling model lifecycle, autoscaling, GPU scheduling, observability, and multi-runtime support.
Think of it this way: vLLM is the engine, LLMKube is the fleet management platform. You can run vLLM directly if you have one node and one model, but LLMKube gives you production-grade orchestration when you need to manage multiple models, teams, and GPUs across a cluster.
Why not KServe/Knative?
KServe is a powerful, general-purpose model serving platform that supports many ML frameworks. It's the right choice if you're serving diverse model types (sklearn, TensorFlow, PyTorch, etc.) and need a unified platform.
LLMKube is purpose-built for LLM inference. We focus on one thing and do it well: deploying local LLMs with GPU acceleration in Kubernetes. This means simpler CRDs, faster setup, and less operational overhead.
If you're already running KServe and just need to add LLM support, you might integrate vLLM as a runtime. If you're starting fresh and only need LLM inference, LLMKube gets you there faster.
Why not Ray Serve?
Ray is a distributed computing framework that happens to support model serving. It's powerful but adds significant complexity: you're running a separate scheduler alongside Kubernetes.
LLMKube is pure Kubernetes-native. No additional scheduler, no new concepts to learn. If you know kubectl and Helm, you know how to operate LLMKube. Your existing monitoring, GitOps pipelines, and RBAC policies just work.
Ray makes sense if you're doing complex ML pipelines with training, preprocessing, and serving all in one system. For inference-only workloads, LLMKube is simpler.
Does it work with Argo CD / Flux?
Yes. LLMKube uses standard Kubernetes Custom Resource Definitions (CRDs). Your Model and InferenceService manifests are just YAML files that Argo CD and Flux handle like any other Kubernetes resource.
Store your model definitions in Git, let your GitOps tool sync them, and LLMKube's operator handles the rest. No special integration required.
What about AMD GPUs?
Not yet, but it's on the roadmap.
Currently we support NVIDIA GPUs (CUDA) and Apple Silicon (Metal). AMD ROCm support is planned for a future release. The llama.cpp backend we use already supports ROCm, so it's primarily an integration and testing effort.
If AMD support is critical for your use case, open an issue on GitHub. Community interest helps us prioritize.
Does it support auto-scaling / HPA?
Yes! HPA autoscaling shipped in v0.6.0. InferenceService supports native Kubernetes Horizontal Pod Autoscaler with per-runtime metrics.
Set minReplicas, maxReplicas, and a target metric value in your InferenceService spec, and LLMKube handles the rest. Each pluggable runtime (llama.cpp, vLLM, TGI) provides its own default HPA metric through the HPAMetricProvider interface.
For example, the llama.cpp runtime scales on llamacpp:requests_processing, while vLLM uses its own queue-depth metrics. You can also specify custom metrics if your setup uses Prometheus Adapter.
What models are supported?
LLMKube supports multiple model formats across its pluggable runtimes. GGUF models via llama.cpp (most popular open models), SafeTensors and PyTorch models via vLLM and TGI, MLX models via the Metal Agent's oMLX backend, and any model format via the generic runtime with custom containers.
We maintain a catalog of 20+ pre-configured models that deploy with a single command (e.g., llmkube deploy llama-3.1-8b --gpu). This includes Llama, Mistral, Qwen, DeepSeek, Mixtral, Phi, and more. When using the Ollama backend on Metal Agent, models are downloaded automatically from the Ollama registry.
How does air-gapped deployment work?
For air-gapped environments, you download models once on a connected system, then transfer them to your air-gapped cluster via your approved data transfer process.
LLMKube's persistent model cache stores models on a PVC. Once the model files are on the PVC (however you get them there), deployments work identically to connected environments. No external network calls required during inference.
We're also working on documentation for pre-loading container images and model files for fully disconnected installations.
Still have questions?
Open an issue on GitHub or reach out directly. We're happy to help.