Introducing LLMKube: Kubernetes for Local LLMs
Today, we're excited to introduce LLMKube - an open-source Kubernetes operator that brings production-grade orchestration to local LLM deployments. If you've ever struggled to deploy AI workloads in air-gapped environments, edge locations, or regulated industries, LLMKube is for you.
The Problem
The AI revolution has a connectivity problem. While cloud-based LLM APIs like OpenAI and Anthropic are incredible, they're off-limits for huge swaths of the economy:
- Defense and Government: Classified environments can't send data to external APIs
- Healthcare: HIPAA compliance makes cloud APIs risky for PHI data
- Manufacturing: Factory floors often have poor or no connectivity
- Financial Services: Data sovereignty requirements prohibit external processing
Local LLMs solve the connectivity problem, but create a new one: how do you run them in production? Most teams resort to fragile Jupyter notebooks or hand-rolled deployment scripts. There's no standardization, no observability, and no way to enforce SLOs.
The Solution: Treat Intelligence as a Workload
We realized that AI inference isn't fundamentally different from any other workload. It needs:
- Declarative configuration
- Automated deployment and scaling
- Health checks and self-healing
- Observability and metrics
- SLO enforcement
- Security and compliance features
Sound familiar? That's exactly what Kubernetes provides for microservices. So we built LLMKube as a Kubernetes operator that extends the platform with AI-specific primitives.
How It Works
LLMKube introduces two Custom Resource Definitions (CRDs):
1. Model
Define which LLM you want to use:
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: phi-3-mini
spec:
source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/...
hardware:
accelerator: cuda 2. InferenceService
Deploy an inference endpoint:
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: phi-3-inference
spec:
modelRef: phi-3-mini
replicas: 3 That's it. LLMKube handles downloading the model, creating Kubernetes Deployments, exposing Services, and providing OpenAI-compatible API endpoints.
What's Next
This is just the beginning. Our roadmap includes:
- SLO Enforcement: Automatic scaling and failover to meet latency targets
- Model Sharding: Distribute large models across multiple nodes
- eBPF Observability: Deep token-level tracing and PII detection
- TEE Support: Secure enclaves for sensitive workloads
- Natural Language Deployment: "Deploy a 70B model with P99 latency under 2s"
Get Started
LLMKube is open source (Apache 2.0) and ready to try:
We'd love to hear your feedback and learn about your use cases. Let's build the future of AI infrastructure together.
About Defilan Technologies: We're building production-grade infrastructure for local AI deployment. LLMKube is our first product, designed for organizations that need on-premises AI with the same operational rigor as their microservices.