Getting Started

Deploy GPU-accelerated LLMs on Kubernetes in 5 minutes

5-Minute Quick Start

Try LLMKube locally on your laptop with Minikube - no cloud account required! Perfect for testing and development.

Prerequisites

  • Kubernetes cluster - Minikube, kind, GKE, EKS, or AKS (v1.11.3+)
  • kubectl installed and configured
  • Helm 3.0+ (for Helm installation method)
  • Cluster admin permissions (to install CRDs)

For local testing: We recommend Minikube with at least 4 CPUs and 8GB RAM. Start with: minikube start --cpus 4 --memory 8192

1. Install the LLMKube CLI

The llmkube CLI makes deployment simple - just one command to deploy any model.

Recommended Quick Install (macOS/Linux)

curl -sSL https://raw.githubusercontent.com/defilantech/LLMKube/main/install.sh | bash

Automatically detects your OS and architecture, downloads the latest release.

macOS via Homebrew
brew tap defilantech/tap
brew install llmkube
Windows Installation

Download the Windows binary from the latest release page.

Extract and add to your PATH.

# Verify installation:

llmkube version

2. Install LLMKube Operator

Install the LLMKube operator to your cluster using Helm (recommended) or Kustomize.

Recommended Option 1: Helm Chart

# Add the Helm repository
helm repo add llmkube https://defilantech.github.io/LLMKube
helm repo update

# Install LLMKube
helm install llmkube llmkube/llmkube \
  --namespace llmkube-system \
  --create-namespace

# Verify installation
kubectl get pods -n llmkube-system

Option 2: Kustomize

# Clone and install (ensures correct image tags)
git clone https://github.com/defilantech/LLMKube.git
cd LLMKube
kubectl apply -k config/default

# Verify installation
kubectl get pods -n llmkube-system

Wait for ready: The operator should be running in the llmkube-system namespace. Wait for all pods to be in Running state before proceeding.

3. Deploy Your First Model

Choose from the pre-configured model catalog or deploy any GGUF model from HuggingFace.

Easiest Deploy from Model Catalog

Browse 10+ pre-configured popular models (Llama, Mistral, Qwen, DeepSeek, Phi-3, and more):

# Browse available models
llmkube catalog list

# Get details about a specific model
llmkube catalog info phi-3-mini

# Deploy with one command!
llmkube deploy phi-3-mini --cpu 500m --memory 1Gi

Deploy Custom Model

Deploy any GGUF model from HuggingFace:

llmkube deploy tinyllama \
  --source https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --cpu 500m \
  --memory 1Gi

Monitor Deployment

# Check deployment status
llmkube status phi-3-mini

# Or use kubectl
kubectl wait --for=condition=available --timeout=300s inferenceservice/phi-3-mini

What happens: LLMKube downloads the model (~600MB-3GB), creates a Deployment with an init container for model loading, and exposes an OpenAI-compatible API endpoint.

4. Test the API

Port-forward to the service and test the OpenAI-compatible endpoint:

# Port forward to local machine
kubectl port-forward svc/phi-3-mini 8080:8080

In another terminal, send a test request:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Explain Kubernetes in one sentence"
      }
    ],
    "max_tokens": 50
  }'

Success! You should receive a JSON response with the model's completion. The API is fully OpenAI-compatible, so you can use existing SDKs and tools.

5. Use with OpenAI SDK

LLMKube is a drop-in replacement for the OpenAI API. Use any OpenAI SDK or library:

from openai import OpenAI

# Point to your LLMKube service
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # LLMKube doesn't require API keys
)

# Use exactly like OpenAI API
response = client.chat.completions.create(
    model="phi-3-mini",
    messages=[
        {"role": "user", "content": "What is Kubernetes?"}
    ]
)

print(response.choices[0].message.content)
Works with: LangChain, LlamaIndex, OpenAI SDKs (Python, Node.js, Go), and any tool that supports OpenAI API format.

🚀 GPU-Accelerated Deployment

Get 17x faster inference with GPU acceleration on GKE, EKS, or any cluster with NVIDIA GPUs:

CPU Baseline:
~18 tok/s
NVIDIA L4 GPU:
~64 tok/s
17x faster!
# Deploy with GPU acceleration
llmkube deploy llama-3.1-8b --gpu --gpu-count 1

# Or from catalog
llmkube catalog info llama-3.1-8b  # See GPU requirements
llmkube deploy llama-3.1-8b --gpu

Next Steps

Troubleshooting

Model stuck in "Downloading" state

Check the init container logs to see download progress:

kubectl logs <pod-name> -c model-downloader

Ensure your cluster has internet access or the model is available via the configured source URL. Large models can take several minutes to download.

Pod crashes with OOMKilled

Increase memory allocation for the deployment:

llmkube deploy <model> --memory 4Gi

Rule of thumb: Model memory should be at least 1.2x the GGUF file size.

GPU not detected

Verify the NVIDIA GPU operator is running:

kubectl get pods -n gpu-operator-resources

Check that GPU nodes are labeled correctly:

kubectl get nodes -l cloud.google.com/gke-accelerator
API requests timing out

Check if the service pod is running:

kubectl get pods -l app=<model-name>

View server logs for errors:

kubectl logs <pod-name> -c llama-server

For larger models or complex prompts, you may need to increase resource allocations or adjust timeout settings.

For more detailed troubleshooting, see the Minikube Quickstart Guide →