v0.7.11 Open Source · Kubernetes Native · Apache 2.0

About LLMKube

A Kubernetes operator that turns self-hosted LLM deployment into a two-line YAML problem.

Why LLMKube exists

Running an LLM on your own hardware is straightforward. Running it for a team is where it falls apart. Model downloads, GPU scheduling, health checks, autoscaling, observability, multi-runtime support. It becomes a full-time job that distracts from building the thing you actually care about.

LLMKube treats LLM inference as a first-class Kubernetes workload. Instead of bolting AI tools onto container orchestration as an afterthought, LLMKube extends Kubernetes with purpose-built CRDs for Model and InferenceService resources. The operator handles everything below the API layer so your team can focus on what they are building, not how inference is running.

Opinionated about infrastructure patterns. Flexible about runtime choices. vLLM for throughput, llama.cpp for efficiency, TGI for flexibility, or bring your own container. One operator, every runtime.

Project principles

The technical philosophy behind every design decision.

Kubernetes-Native, Not Kubernetes-Adjacent

LLMKube extends Kubernetes with CRDs, not wrappers around it. Your existing kubectl, Helm, GitOps, RBAC, and monitoring workflows apply without modification.

Runtime-Agnostic by Design

No single inference engine is best for every workload. LLMKube provides a pluggable backend interface so you can choose vLLM for throughput, llama.cpp for efficiency, TGI for flexibility, or bring your own container.

Observable by Default

Every deployment ships with Prometheus metrics, Grafana dashboards, and OpenTelemetry tracing. You should never have to wonder what your inference stack is doing.

Works Where You Are

Cloud, on-prem, air-gapped, edge, or a Mac on your desk. LLMKube runs wherever Kubernetes runs, with the Metal Agent extending GPU access to Apple Silicon nodes that containers cannot reach.