Kubernetes for Local LLMs
Deploy, manage, and scale AI inference workloads in air-gapped, edge, and hybrid environments with production-grade orchestration.
Designed for teams in defense, healthcare, and edge computing
See it in action
Deploy an LLM with familiar kubectl commands
Why LLMKube?
LLMs are stuck in notebooks. We're bringing them to production.
Before LLMKube
- ✗ Manually download models
- ✗ Write hacky Python scripts
- ✗ Pray the GPU works
- ✗ No observability
- ✗ Manual scaling
- ✗ "Works on my laptop"
After LLMKube
- ✓ Automatic model management
- ✓ Declarative CRDs
- ✓ Production-grade deployments
- ✓ Full observability (Prometheus + Grafana + GPU metrics)
- ✓ Horizontal scaling
- ✓ Works everywhere: cloud, edge, air-gapped
Kubernetes won because it made infrastructure declarative, portable, and self-healing.
LLMKube does the same for intelligence.
The paradigm shift from notebooks to production
How it works
The init container pattern: Separate model management from serving
LLMKube Architecture
Click on any component to learn more about it
💡 Innovation: Init Container Pattern
LLMKube uses Kubernetes init containers to separate model downloading from serving. This means:
- Fast cold starts: Models are cached in persistent storage
- Separation of concerns: Download logic separate from inference logic
- Kubernetes-native: Uses standard patterns that platform engineers already know
Production-validated performance
Real benchmarks from GKE deployments - CPU and GPU
CPU Baseline
Cost-EffectiveConfiguration
- Model:
- TinyLlama 1.1B Q4_K_M
- Platform:
- GKE n2-standard-2
- Model Size:
- 637.8 MiB
Performance
- Token Generation:
- ~18.5 tok/s
- Prompt Processing:
- ~29 tok/s
- Response Time:
- ~1.5s (P50)
- Cold Start:
- ~5s
GPU Accelerated
17x FasterConfiguration
- Model:
- Llama 3.2 3B Q8_0
- Platform:
- GKE + NVIDIA L4 GPU
- Model Size:
- 3.2 GiB
- GPU Layers:
- 29/29 (100%)
Performance
- Token Generation:
- ~64 tok/s
- Prompt Processing:
- ~1,026 tok/s
- Response Time:
- ~0.6s
- GPU Memory:
- 4.2 GB VRAM
- Power Usage:
- ~35W
- Temperature:
- 56-58°C
Production Ready: GPU acceleration delivers 17x faster inference with full observability (Prometheus + Grafana + DCGM metrics). Both CPU and GPU deployments are production-ready with comprehensive monitoring and SLO alerts.
Production-grade LLM orchestration
Everything you need to run local AI workloads at scale, with the operational rigor of modern microservices
GPU Acceleration
Production-ready GPU inference with NVIDIA CUDA support. Achieve 17x faster inference with automatic GPU layer offloading.
Kubernetes Native
Deploy LLMs using familiar kubectl commands and YAML manifests. Integrates seamlessly with your existing K8s infrastructure and GitOps workflows.
Air-Gap Ready
Run LLMs in completely disconnected environments. Perfect for defense, healthcare, and regulated industries that can't use cloud APIs.
Built-in Observability
Full observability stack with Prometheus, Grafana dashboards, DCGM GPU metrics, and automated SLO alerts for production monitoring.
Edge Optimized
Distribute inference across edge nodes with intelligent scheduling. Model sharding for efficient resource utilization.
SLO Enforcement
Define and enforce latency targets, quality thresholds, and performance SLAs. Self-healing controllers keep your services running.
OpenAI Compatible
Drop-in replacement for OpenAI API endpoints. Use existing tools and libraries without code changes.
Deploy an LLM in seconds
Simple, declarative YAML that feels native to Kubernetes developers
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: phi-3-mini
spec:
source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
format: gguf
quantization: Q4_K_M
hardware:
accelerator: cuda
gpu:
enabled: true
count: 1
resources:
cpu: "2"
memory: "4Gi"Built for real-world deployments
From air-gapped data centers to distributed edge networks
Defense & Government
Deploy classified AI assistants in SCIF environments with TEE support and attestation. Meet CMMC and FedRAMP requirements.
- Air-gapped deployment
- Complete audit logs
- Zero external dependencies
Healthcare & Life Sciences
Run HIPAA-compliant AI assistants for clinical decision support without sending PHI to external APIs.
- HIPAA compliant
- PII detection with eBPF
- On-premises deployment
Manufacturing & Industrial
Deploy AI assistants on factory floors with poor connectivity. Distributed inference across edge devices.
- Offline-first operation
- Low-latency edge inference
- Resource-constrained optimization
Financial Services
Run AI models in regulated environments with complete data sovereignty and compliance guarantees.
- Data sovereignty
- Compliance-ready logging
- High availability SLAs
Ready to deploy your first LLM?
Join the growing community of developers and enterprises running production AI workloads on LLMKube
Open source and free forever • Enterprise support coming soon