Frequently Asked Questions
Common questions about LLMKube, how it compares to alternatives, and what it can do for you.
Why not NVIDIA Dynamo?
NVIDIA Dynamo is built for enterprise GPU fleets (GB200, H100) and maximum throughput at massive scale. It's optimized for data centers with dedicated infrastructure teams.
LLMKube is for teams who need local inference running quickly on the hardware they have. We focus on air-gapped deployments, regulated industries, and edge computing where simplicity matters more than squeezing every last token per second from expensive hardware.
If you have a fleet of H100s and a dedicated MLOps team, Dynamo might be right for you. If you need to deploy LLMs on your existing Kubernetes cluster without a PhD in distributed systems, that's what LLMKube is for.
Why not Ollama?
Ollama is excellent for single-node local development. We use it ourselves for quick testing.
LLMKube solves a different problem: multi-node orchestration in Kubernetes with production-grade observability. If you need to deploy LLMs across a cluster, integrate with existing K8s tooling (GitOps, Prometheus, etc.), or run in air-gapped environments with proper compliance controls, that's where LLMKube shines.
Think of it this way: Ollama is like running SQLite on your laptop. LLMKube is like running Postgres in production.
Why not vLLM?
vLLM is a fantastic inference engine focused on maximum throughput with techniques like PagedAttention. It's great if throughput is your primary concern.
LLMKube uses llama.cpp as its backend, which prioritizes simplicity, broad hardware support, and easy deployment. We're optimized for edge and air-gapped deployments where you need something running in an afternoon, not a week.
vLLM also requires more GPU memory and has a steeper operational learning curve. LLMKube gives you a simpler path to production with Kubernetes-native patterns you already know.
Why not KServe/Knative?
KServe is a powerful, general-purpose model serving platform that supports many ML frameworks. It's the right choice if you're serving diverse model types (sklearn, TensorFlow, PyTorch, etc.) and need a unified platform.
LLMKube is purpose-built for LLM inference. We focus on one thing and do it well: deploying local LLMs with GPU acceleration in Kubernetes. This means simpler CRDs, faster setup, and less operational overhead.
If you're already running KServe and just need to add LLM support, you might integrate vLLM as a runtime. If you're starting fresh and only need LLM inference, LLMKube gets you there faster.
Why not Ray Serve?
Ray is a distributed computing framework that happens to support model serving. It's powerful but adds significant complexity: you're running a separate scheduler alongside Kubernetes.
LLMKube is pure Kubernetes-native. No additional scheduler, no new concepts to learn. If you know kubectl and Helm, you know how to operate LLMKube. Your existing monitoring, GitOps pipelines, and RBAC policies just work.
Ray makes sense if you're doing complex ML pipelines with training, preprocessing, and serving all in one system. For inference-only workloads, LLMKube is simpler.
Does it work with Argo CD / Flux?
Yes. LLMKube uses standard Kubernetes Custom Resource Definitions (CRDs). Your Model and InferenceService manifests are just YAML files that Argo CD and Flux handle like any other Kubernetes resource.
Store your model definitions in Git, let your GitOps tool sync them, and LLMKube's operator handles the rest. No special integration required.
What about AMD GPUs?
Not yet, but it's on the roadmap.
Currently we support NVIDIA GPUs (CUDA) and Apple Silicon (Metal). AMD ROCm support is planned for a future release. The llama.cpp backend we use already supports ROCm, so it's primarily an integration and testing effort.
If AMD support is critical for your use case, open an issue on GitHub. Community interest helps us prioritize.
Does it support auto-scaling / HPA?
Not yet. Currently, scaling is manual (you set the replica count in your InferenceService spec).
Queue-depth based auto-scaling is planned for v0.5.0. The goal is to scale replicas based on request queue depth and GPU utilization, similar to KEDA but purpose-built for LLM workloads.
What models are supported?
LLMKube supports any GGUF-format model. This includes most popular open models: Llama, Mistral, Qwen, DeepSeek, Mixtral, Phi, and many more.
We maintain a catalog of 10 pre-configured models that deploy with a single command (e.g., llmkube deploy llama-3.1-8b --gpu). For models not in the catalog, just provide the HuggingFace URL or any HTTP source in your Model spec.
How does air-gapped deployment work?
For air-gapped environments, you download models once on a connected system, then transfer them to your air-gapped cluster via your approved data transfer process.
LLMKube's persistent model cache stores models on a PVC. Once the model files are on the PVC (however you get them there), deployments work identically to connected environments. No external network calls required during inference.
We're also working on documentation for pre-loading container images and model files for fully disconnected installations.
Still have questions?
Open an issue on GitHub or reach out directly. We're happy to help.