Getting Started
Deploy GPU-accelerated LLMs on Kubernetes in 5 minutes
5-Minute Quick Start
Try LLMKube locally on your laptop with Minikube - no cloud account required! Perfect for testing and development.
Prerequisites
- Kubernetes cluster - Minikube, kind, GKE, EKS, or AKS (v1.11.3+)
-
kubectlinstalled and configured -
Helm3.0+ (for Helm installation method) - Cluster admin permissions (to install CRDs)
For local testing: We recommend Minikube with at least 4 CPUs and 8GB RAM. Start with: minikube start --cpus 4 --memory 8192
1. Install the LLMKube CLI
The llmkube CLI makes deployment simple - just one command to deploy any model.
Recommended Quick Install (macOS/Linux)
curl -sSL https://raw.githubusercontent.com/defilantech/LLMKube/main/install.sh | bashAutomatically detects your OS and architecture, downloads the latest release.
macOS via Homebrew
brew tap defilantech/tap
brew install llmkubeWindows Installation
Download the Windows binary from the latest release page.
Extract and add to your PATH.
# Verify installation:
llmkube version2. Install LLMKube Operator
Install the LLMKube operator to your cluster using Helm (recommended) or Kustomize.
Recommended Option 1: Helm Chart
# Add the Helm repository
helm repo add llmkube https://defilantech.github.io/LLMKube
helm repo update
# Install LLMKube
helm install llmkube llmkube/llmkube \
--namespace llmkube-system \
--create-namespace
# Verify installation
kubectl get pods -n llmkube-systemOption 2: Kustomize
# Clone and install (ensures correct image tags)
git clone https://github.com/defilantech/LLMKube.git
cd LLMKube
kubectl apply -k config/default
# Verify installation
kubectl get pods -n llmkube-systemWait for ready: The operator should be running in the llmkube-system namespace. Wait for all pods to be in Running state before proceeding.
3. Deploy Your First Model
Choose from the pre-configured model catalog or deploy any GGUF model from HuggingFace.
Easiest Deploy from Model Catalog
Browse 10+ pre-configured popular models (Llama, Mistral, Qwen, DeepSeek, Phi-3, and more):
# Browse available models
llmkube catalog list
# Get details about a specific model
llmkube catalog info phi-3-mini
# Deploy with one command!
llmkube deploy phi-3-mini --cpu 500m --memory 1GiDeploy Custom Model
Deploy any GGUF model from HuggingFace:
llmkube deploy tinyllama \
--source https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--cpu 500m \
--memory 1GiMonitor Deployment
# Check deployment status
llmkube status phi-3-mini
# Or use kubectl
kubectl wait --for=condition=available --timeout=300s inferenceservice/phi-3-miniWhat happens: LLMKube downloads the model (~600MB-3GB), creates a Deployment with an init container for model loading, and exposes an OpenAI-compatible API endpoint.
4. Test the API
Port-forward to the service and test the OpenAI-compatible endpoint:
# Port forward to local machine
kubectl port-forward svc/phi-3-mini 8080:8080In another terminal, send a test request:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Explain Kubernetes in one sentence"
}
],
"max_tokens": 50
}'Success! You should receive a JSON response with the model's completion. The API is fully OpenAI-compatible, so you can use existing SDKs and tools.
5. Use with OpenAI SDK
LLMKube is a drop-in replacement for the OpenAI API. Use any OpenAI SDK or library:
from openai import OpenAI
# Point to your LLMKube service
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed" # LLMKube doesn't require API keys
)
# Use exactly like OpenAI API
response = client.chat.completions.create(
model="phi-3-mini",
messages=[
{"role": "user", "content": "What is Kubernetes?"}
]
)
print(response.choices[0].message.content)🚀 GPU-Accelerated Deployment
Get 17x faster inference with GPU acceleration on GKE, EKS, or any cluster with NVIDIA GPUs:
# Deploy with GPU acceleration
llmkube deploy llama-3.1-8b --gpu --gpu-count 1
# Or from catalog
llmkube catalog info llama-3.1-8b # See GPU requirements
llmkube deploy llama-3.1-8b --gpuNext Steps
Minikube Quickstart
Detailed guide for running LLMKube locally on your laptop with Minikube - includes troubleshooting and optimization tips.
GPU Setup Guide
Deploy GPU-accelerated clusters on GKE with Terraform configs included. Get 17x faster inference with NVIDIA GPUs.
Explore GitHub
Browse examples, read full documentation, and explore the source code. See Terraform configs for production deployments.
Join the Community
Connect with other LLMKube users, ask questions, and share your experiences on GitHub Discussions.
Troubleshooting
Model stuck in "Downloading" state
Check the init container logs to see download progress:
kubectl logs <pod-name> -c model-downloader Ensure your cluster has internet access or the model is available via the configured source URL. Large models can take several minutes to download.
Pod crashes with OOMKilled
Increase memory allocation for the deployment:
llmkube deploy <model> --memory 4Gi Rule of thumb: Model memory should be at least 1.2x the GGUF file size.
GPU not detected
Verify the NVIDIA GPU operator is running:
kubectl get pods -n gpu-operator-resources Check that GPU nodes are labeled correctly:
kubectl get nodes -l cloud.google.com/gke-acceleratorAPI requests timing out
Check if the service pod is running:
kubectl get pods -l app=<model-name> View server logs for errors:
kubectl logs <pod-name> -c llama-server For larger models or complex prompts, you may need to increase resource allocations or adjust timeout settings.