Production-Ready LLM Infrastructure
vLLM, TGI, llama.cpp, or bring your own — one operator for every runtime.
How it works
The init container pattern: separate model management from serving
Two lanes: control plane above, data plane below. The Model Controller downloads weights once per namespace into a shared cache. The InferenceService Controller creates pods whose init container mounts that cache before the runtime starts. Cold-starting a new replica is a pod boot — no re-download.
Hover or focus a component for detail
Init Container Pattern
Models cached in persistent storage
Download logic separate from inference
Standard patterns engineers already know
What's working today
Production-validated features ready for your deployments
Hybrid GPU/CPU Offloading
New v0.7.0Run 30B+ MoE models on consumer GPUs. Expert weights in system RAM, active path on GPU, scheduler-aware host memory requests.
- moeCPUOffload, moeCPULayers, noKvOffload fields
- hostMemory request for scheduler placement
- Tensor overrides and batch-size controls
- Seven new runtime controls for llama.cpp and vLLM
- HuggingFace repo ID source for runtime-resolved models
Pluggable Runtime Backends
v0.6.0Choose the best inference engine for your workload. One CRD, five runtimes.
- vLLM with PagedAttention and tensor parallelism
- TGI (HuggingFace Text Generation Inference)
- llama.cpp with GGUF and low-memory efficiency
- PersonaPlex (Moshi) for speech-to-speech
- Generic runtime for custom containers
HPA Autoscaling
v0.6.0Scale inference replicas automatically based on real metrics. Per-runtime metric defaults.
- Kubernetes HPA with custom inference metrics
- Configurable min/max replicas and target values
- Per-runtime default metrics via HPAMetricProvider
- Works with Prometheus Adapter
Inference Metrics Dashboard
v0.6.0Pre-built Grafana dashboard for inference performance. Track latency, throughput, and errors.
- Request latency, throughput, and error rates
- Per-model and per-runtime breakdowns
- Import-ready JSON in config/grafana/
GPU Acceleration
Updated v0.6.0NVIDIA CUDA 13 support with Blackwell GPUs. Custom layer splits via GPUShardingSpec.
- CUDA 13 with Blackwell and Qwen3.5 support
- Custom layer splits from GPUShardingSpec
- 64 tok/s on Llama 3.2 3B (17x faster than CPU)
- GKE + NVIDIA GPU Operator ready
Metal Agent
Three pluggable runtime backends for Apple Silicon inference. Choose the best fit for your workflow.
- Ollama backend (200K+ users, auto model download)
- oMLX backend (MLX-native, 40% faster)
- llama-server backend (direct llama.cpp control)
- M1/M2/M3/M4/M5 with health checks and metrics
KV Cache & Advanced Config
Fine-tune llama.cpp behavior through the CRD. Configure KV cache quantization or pass any flag.
- cacheTypeK / cacheTypeV for KV cache quantization
- extraArgs escape hatch for any llama-server flag
- Reduce VRAM usage up to 4.5x with cache compression
GPU Queue Visibility
See where workloads stand in the GPU queue. Priority classes control scheduling.
- Real-time queue position in status
- GPU contention visibility
- Priority classes for scheduling control
GPU Monitoring
Prometheus and Grafana integration with DCGM GPU metrics and pre-built dashboards.
- GPU utilization, temp, power monitoring
- Pre-built Grafana dashboards
- Prometheus metrics integration
Kubernetes Native
Custom Resource Definitions for Model and InferenceService. Works with kubectl and GitOps.
- Declarative YAML configuration
- GitOps-ready deployments
- Standard K8s patterns
Automatic Model Management
Download from HuggingFace, HTTP, or PVC sources. Automatic caching and SHA256 validation.
- HuggingFace + HTTP + PVC sources
- GGUF format support
- Persistent volume caching
OpenAI Compatible
Drop-in replacement for OpenAI API. Use existing tools without code changes.
- /v1/chat/completions endpoint
- Streaming responses
- Compatible with LangChain, etc.
CLI Tool
Simple command-line interface for deploying and managing LLM workloads.
- Deploy with --gpu flag
- List, status, delete commands
- macOS, Linux, Windows binaries
CLI Benchmark Suites
Five test suites for comprehensive validation with automated sweeps and markdown reports.
- llmkube benchmark --suite quick
- Concurrency, context, and token sweeps
- Markdown reports for sharing
Grafana Dashboard
Pre-built GPU observability dashboard for monitoring utilization, temperature, and memory.
- Multi-GPU monitoring with DCGM
- Import-ready JSON in config/grafana/
Persistent Model Cache
Download models once, deploy instantly across services. Reduce bandwidth and startup times.
- Per-namespace model cache PVC
- Instant model switching
- Configurable cache invalidation
Model Catalog
20+ pre-configured models. Deploy instantly with optimized settings.
- One-command deployments
- Llama, Mistral, Qwen, DeepSeek, Mixtral, Phi
- Smart defaults with override support
Multi-GPU Support
Deploy 13B-70B+ models across GPUs with automatic layer sharding.
- ~44 tok/s on Llama 13B with 2x RTX 5060 Ti
- Automatic tensor split calculation
- Layer-based sharding
Multi-Cloud Support
Deploy on any Kubernetes distribution. Standard K8s patterns.
- Tested on GKE, kind, Minikube
- Works on AKS, EKS, bare metal
- Custom tolerations and nodeSelector
Helm Chart
Production-ready chart with 50+ configurable parameters.
- helm install llmkube llmkube/llmkube
- ImagePullSecrets for private registries
- Namespace isolation, RBAC included
Validated performance
Real benchmarks from GKE deployments — CPU and GPU
CPU Baseline
TinyLlama 1.1B on GKE n2-standard-2- Token generation
- ~18.5 tok/s
- Prompt processing
- ~29 tok/s
- Response time
- ~1.5s (P50)
- Cold start
- ~5s
GPU Accelerated
Llama 3.2 3B on GKE + NVIDIA L4- Token generation
- ~64 tok/s
- Prompt processing
- ~1,026 tok/s
- Response time
- ~0.6s
- GPU memory
- 4.2 GB VRAM
- Power usage
- ~35W
What's coming next
Multi-node sharding, SLO enforcement, and production hardening
JSON Benchmark Output
Programmatic output for CI/CD integration and automated tracking.
- JSON output format
- CI pipeline integration
- Diff comparison across runs
Runtime Extensions
Community-contributed runtimes and deeper integration with existing backends.
- Contributor guide for adding runtimes
- Runtime-specific health checks
- Unified metrics across runtimes
Multi-Node GPU Sharding
Distribute 70B+ models across multiple GPU nodes with intelligent layer scheduling and P2P KV cache sharing.
SLO Enforcement
Automatic monitoring, enforcement, and intelligent fallback mechanisms with latency-based routing.
Edge Optimization
Distribute inference across edge nodes with geo-aware placement and bandwidth-optimized routing.
Advanced Observability
Per-request cost tracking, quality monitoring, and advanced performance dashboards.
We build incrementally with production validation at each step. Roadmap subject to change based on community feedback.
Ready to get started?
Deploy your first GPU-accelerated LLM in minutes.