v0.5.1 Open Source · Kubernetes Native

Run production LLMs
on your own hardware

We analyzed 200 GitHub issues with a 14B model on two $400 GPUs. Total cost: one cent. LLMKube makes self-hosted inference actually work.

Get Started View on GitHub Join Discord

See it in action

Deploy GPU-accelerated LLMs in seconds with the llmkube CLI

terminal

$ llmkube catalog list

📚 LLMKube Model Catalog (v1.0) ═══════════════════════════════════════════════════════════════ ID NAME SIZE QUANT VRAM ── ──── ──── ───── ──── llama-3.1-8b Llama 3.1 8B Instruct 8B Q5_K_M 5-8GB qwen-2.5-coder-7b Qwen 2.5 Coder 7B 7B Q5_K_M 5-8GB mistral-7b Mistral 7B Instruct 7B Q5_K_M 5-8GB phi-3-mini Phi-3 Mini (3.8B) 3.8B Q5_K_M 2-4GB 💡 To deploy: llmkube deploy <MODEL_ID> --gpu

Step 1/4: Browse available models

Why LLMKube?

Local LLMs are great for prototyping. Scaling them for a team is where it gets hard.

The scaling problem

× Silent failures with no alerts
× Multi-GPU memory math by trial and error
× Updates that break your setup
× Docker Compose that doesn't scale
× One person managing everything
× Every machine set up by hand

With LLMKube

Health checks that actually tell you when things break
GPU layer offloading with automatic configuration
Helm-pinned versions that don't break on update
Infrastructure as code, not scripts and duct tape
Your whole team can deploy and manage
Prometheus + Grafana integration for GPU monitoring

Ollama for dev. vLLM for speed. LLMKube for Kubernetes.

The platform layer your inference engine is missing.

Deploy an LLM in seconds

Simple, declarative YAML that feels native to Kubernetes developers

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: phi-3-mini
spec:
  source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
  format: gguf
  quantization: Q4_K_M
  hardware:
    accelerator: cuda
    gpu:
      enabled: true
      count: 1
  resources:
    cpu: "2"
    memory: "4Gi"

Supports GGUF models from HuggingFace, with automatic download and caching

Limited to 10 Teams

Early Adopter Program

Help shape the future of LLMKube and get direct access to the maintainer.

What you get

Private Discord with other early adopters
Direct input on the roadmap
Your logo on our website (when ready)
Early access to new features

What we need

Real-world feedback on your use case
30 minutes monthly for a feedback call
Permission to share your story (anonymized if needed)

Apply to join

Ready to deploy your first LLM?

Join the community of developers deploying LLMs on Kubernetes.

Get Started on GitHub Read the Docs

Open source and free forever

Run production LLMs on your own hardware