About LLMKube
A Kubernetes operator that turns self-hosted LLM deployment into a two-line YAML problem.
Why LLMKube exists
Running an LLM on your own hardware is straightforward. Running it for a team is where it falls apart. Model downloads, GPU scheduling, health checks, autoscaling, observability, multi-runtime support. It becomes a full-time job that distracts from building the thing you actually care about.
LLMKube treats LLM inference as a first-class Kubernetes workload. Instead of bolting AI tools onto container orchestration as an afterthought, LLMKube extends Kubernetes with purpose-built CRDs for Model and InferenceService resources. The operator handles everything below the API layer so your team can focus on what they are building, not how inference is running.
Opinionated about infrastructure patterns. Flexible about runtime choices. vLLM for throughput, llama.cpp for efficiency, TGI for flexibility, or bring your own container. One operator, every runtime.
Project principles
The technical philosophy behind every design decision.
Kubernetes-Native, Not Kubernetes-Adjacent
LLMKube extends Kubernetes with CRDs, not wrappers around it. Your existing kubectl, Helm, GitOps, RBAC, and monitoring workflows apply without modification.
Runtime-Agnostic by Design
No single inference engine is best for every workload. LLMKube provides a pluggable backend interface so you can choose vLLM for throughput, llama.cpp for efficiency, TGI for flexibility, or bring your own container.
Observable by Default
Every deployment ships with Prometheus metrics, Grafana dashboards, and OpenTelemetry tracing. You should never have to wonder what your inference stack is doing.
Works Where You Are
Cloud, on-prem, air-gapped, edge, or a Mac on your desk. LLMKube runs wherever Kubernetes runs, with the Metal Agent extending GPU access to Apple Silicon nodes that containers cannot reach.
Project at a glance
Apache 2.0
Open Source License
5
Pluggable Runtimes
20+
Pre-Configured Models
10+
CLI Commands
50+
Helm Parameters
CUDA + Metal
GPU Acceleration
Get involved
LLMKube is built by the people who use it. Here's how to join in.
Contribute Code
LLMKube is written in Go with a Helm chart and CLI. Pick up a good-first-issue or propose a new runtime backend.
Join the Community
Ask questions, share what you are building, and connect with other LLMKube users and contributors.
Report Issues
Found a bug? Have an idea for a feature? The roadmap is shaped by community feedback. Every issue gets read.
Built in the open since 2025
LLMKube is created and maintained by Defilan Technologies LLC in Washington State. The project is Apache 2.0 licensed and free forever.