v0.7.0 Hybrid GPU/CPU Offloading · Runtime Controls · HF Repo ID Support

Production-Ready LLM Infrastructure

vLLM, TGI, llama.cpp, or bring your own — one operator for every runtime.

How it works

The init container pattern: separate model management from serving

CONTROL PLANEKUBERNETESDATA PLANEWORKLOADAPPLYWATCHMOUNTROUTE
How to read this

Two lanes: control plane above, data plane below. The Model Controller downloads weights once per namespace into a shared cache. The InferenceService Controller creates pods whose init container mounts that cache before the runtime starts. Cold-starting a new replica is a pod boot — no re-download.

Hover or focus a component for detail

Init Container Pattern

Fast cold starts
Models cached in persistent storage
Separation of concerns
Download logic separate from inference
Kubernetes-native
Standard patterns engineers already know
Available Now

What's working today

Production-validated features ready for your deployments

Hybrid GPU/CPU Offloading

New v0.7.0

Run 30B+ MoE models on consumer GPUs. Expert weights in system RAM, active path on GPU, scheduler-aware host memory requests.

  • moeCPUOffload, moeCPULayers, noKvOffload fields
  • hostMemory request for scheduler placement
  • Tensor overrides and batch-size controls
  • Seven new runtime controls for llama.cpp and vLLM
  • HuggingFace repo ID source for runtime-resolved models

Pluggable Runtime Backends

v0.6.0

Choose the best inference engine for your workload. One CRD, five runtimes.

  • vLLM with PagedAttention and tensor parallelism
  • TGI (HuggingFace Text Generation Inference)
  • llama.cpp with GGUF and low-memory efficiency
  • PersonaPlex (Moshi) for speech-to-speech
  • Generic runtime for custom containers

HPA Autoscaling

v0.6.0

Scale inference replicas automatically based on real metrics. Per-runtime metric defaults.

  • Kubernetes HPA with custom inference metrics
  • Configurable min/max replicas and target values
  • Per-runtime default metrics via HPAMetricProvider
  • Works with Prometheus Adapter

Inference Metrics Dashboard

v0.6.0

Pre-built Grafana dashboard for inference performance. Track latency, throughput, and errors.

  • Request latency, throughput, and error rates
  • Per-model and per-runtime breakdowns
  • Import-ready JSON in config/grafana/

GPU Acceleration

Updated v0.6.0

NVIDIA CUDA 13 support with Blackwell GPUs. Custom layer splits via GPUShardingSpec.

  • CUDA 13 with Blackwell and Qwen3.5 support
  • Custom layer splits from GPUShardingSpec
  • 64 tok/s on Llama 3.2 3B (17x faster than CPU)
  • GKE + NVIDIA GPU Operator ready

Metal Agent

Three pluggable runtime backends for Apple Silicon inference. Choose the best fit for your workflow.

  • Ollama backend (200K+ users, auto model download)
  • oMLX backend (MLX-native, 40% faster)
  • llama-server backend (direct llama.cpp control)
  • M1/M2/M3/M4/M5 with health checks and metrics

KV Cache & Advanced Config

Fine-tune llama.cpp behavior through the CRD. Configure KV cache quantization or pass any flag.

  • cacheTypeK / cacheTypeV for KV cache quantization
  • extraArgs escape hatch for any llama-server flag
  • Reduce VRAM usage up to 4.5x with cache compression

GPU Queue Visibility

See where workloads stand in the GPU queue. Priority classes control scheduling.

  • Real-time queue position in status
  • GPU contention visibility
  • Priority classes for scheduling control

GPU Monitoring

Prometheus and Grafana integration with DCGM GPU metrics and pre-built dashboards.

  • GPU utilization, temp, power monitoring
  • Pre-built Grafana dashboards
  • Prometheus metrics integration

Kubernetes Native

Custom Resource Definitions for Model and InferenceService. Works with kubectl and GitOps.

  • Declarative YAML configuration
  • GitOps-ready deployments
  • Standard K8s patterns

Automatic Model Management

Download from HuggingFace, HTTP, or PVC sources. Automatic caching and SHA256 validation.

  • HuggingFace + HTTP + PVC sources
  • GGUF format support
  • Persistent volume caching

OpenAI Compatible

Drop-in replacement for OpenAI API. Use existing tools without code changes.

  • /v1/chat/completions endpoint
  • Streaming responses
  • Compatible with LangChain, etc.

CLI Tool

Simple command-line interface for deploying and managing LLM workloads.

  • Deploy with --gpu flag
  • List, status, delete commands
  • macOS, Linux, Windows binaries

CLI Benchmark Suites

Five test suites for comprehensive validation with automated sweeps and markdown reports.

  • llmkube benchmark --suite quick
  • Concurrency, context, and token sweeps
  • Markdown reports for sharing

Grafana Dashboard

Pre-built GPU observability dashboard for monitoring utilization, temperature, and memory.

LLMKube Grafana Dashboard
  • Multi-GPU monitoring with DCGM
  • Import-ready JSON in config/grafana/

Persistent Model Cache

Download models once, deploy instantly across services. Reduce bandwidth and startup times.

  • Per-namespace model cache PVC
  • Instant model switching
  • Configurable cache invalidation

Model Catalog

20+ pre-configured models. Deploy instantly with optimized settings.

  • One-command deployments
  • Llama, Mistral, Qwen, DeepSeek, Mixtral, Phi
  • Smart defaults with override support

Multi-GPU Support

Deploy 13B-70B+ models across GPUs with automatic layer sharding.

  • ~44 tok/s on Llama 13B with 2x RTX 5060 Ti
  • Automatic tensor split calculation
  • Layer-based sharding

Multi-Cloud Support

Deploy on any Kubernetes distribution. Standard K8s patterns.

  • Tested on GKE, kind, Minikube
  • Works on AKS, EKS, bare metal
  • Custom tolerations and nodeSelector

Helm Chart

Production-ready chart with 50+ configurable parameters.

  • helm install llmkube llmkube/llmkube
  • ImagePullSecrets for private registries
  • Namespace isolation, RBAC included

Validated performance

Real benchmarks from GKE deployments — CPU and GPU

CPU Baseline

TinyLlama 1.1B on GKE n2-standard-2
Token generation
~18.5 tok/s
Prompt processing
~29 tok/s
Response time
~1.5s (P50)
Cold start
~5s

GPU Accelerated

Llama 3.2 3B on GKE + NVIDIA L4
Token generation
~64 tok/s
Prompt processing
~1,026 tok/s
Response time
~0.6s
GPU memory
4.2 GB VRAM
Power usage
~35W
In Development

What's coming next

Multi-node sharding, SLO enforcement, and production hardening

v0.8.0

JSON Benchmark Output

Programmatic output for CI/CD integration and automated tracking.

  • JSON output format
  • CI pipeline integration
  • Diff comparison across runs

Runtime Extensions

Community-contributed runtimes and deeper integration with existing backends.

  • Contributor guide for adding runtimes
  • Runtime-specific health checks
  • Unified metrics across runtimes
Future phases

Multi-Node GPU Sharding

Distribute 70B+ models across multiple GPU nodes with intelligent layer scheduling and P2P KV cache sharing.

SLO Enforcement

Automatic monitoring, enforcement, and intelligent fallback mechanisms with latency-based routing.

Edge Optimization

Distribute inference across edge nodes with geo-aware placement and bandwidth-optimized routing.

Advanced Observability

Per-request cost tracking, quality monitoring, and advanced performance dashboards.

We build incrementally with production validation at each step. Roadmap subject to change based on community feedback.

Ready to get started?

Deploy your first GPU-accelerated LLM in minutes.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.