v0.6.0 Open Source · Kubernetes Native · Apache 2.0

About LLMKube

A Kubernetes operator that turns self-hosted LLM deployment into a two-line YAML problem.

Why LLMKube exists

Running an LLM on your own hardware is straightforward. Running it for a team is where it falls apart. Model downloads, GPU scheduling, health checks, autoscaling, observability, multi-runtime support. It becomes a full-time job that distracts from building the thing you actually care about.

LLMKube treats LLM inference as a first-class Kubernetes workload. Instead of bolting AI tools onto container orchestration as an afterthought, LLMKube extends Kubernetes with purpose-built CRDs for Model and InferenceService resources. The operator handles everything below the API layer so your team can focus on what they are building, not how inference is running.

Opinionated about infrastructure patterns. Flexible about runtime choices. vLLM for throughput, llama.cpp for efficiency, TGI for flexibility, or bring your own container. One operator, every runtime.

Project principles

The technical philosophy behind every design decision.

Kubernetes-Native, Not Kubernetes-Adjacent

LLMKube extends Kubernetes with CRDs, not wrappers around it. Your existing kubectl, Helm, GitOps, RBAC, and monitoring workflows apply without modification.

Runtime-Agnostic by Design

No single inference engine is best for every workload. LLMKube provides a pluggable backend interface so you can choose vLLM for throughput, llama.cpp for efficiency, TGI for flexibility, or bring your own container.

Observable by Default

Every deployment ships with Prometheus metrics, Grafana dashboards, and OpenTelemetry tracing. You should never have to wonder what your inference stack is doing.

Works Where You Are

Cloud, on-prem, air-gapped, edge, or a Mac on your desk. LLMKube runs wherever Kubernetes runs, with the Metal Agent extending GPU access to Apple Silicon nodes that containers cannot reach.

Project at a glance

Apache 2.0

Open Source License

5

Pluggable Runtimes

20+

Pre-Configured Models

10+

CLI Commands

50+

Helm Parameters

CUDA + Metal

GPU Acceleration

Get involved

LLMKube is built by the people who use it. Here's how to join in.

Contribute Code

LLMKube is written in Go with a Helm chart and CLI. Pick up a good-first-issue or propose a new runtime backend.

Join the Community

Ask questions, share what you are building, and connect with other LLMKube users and contributors.

Report Issues

Found a bug? Have an idea for a feature? The roadmap is shaped by community feedback. Every issue gets read.

Built in the open since 2025

LLMKube is created and maintained by Defilan Technologies LLC in Washington State. The project is Apache 2.0 licensed and free forever.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.