Skip to content
v0.8.1 mlx-server on Apple Silicon · kubectl scale · Native HPA Autoscaling · Multi-Runtime

Production-Ready LLM Infrastructure

vLLM, TGI, llama.cpp, or bring your own — one operator for every runtime.

How it works

The init container pattern: separate model management from serving

CONTROL PLANEKUBERNETESDATA PLANEWORKLOADAPPLYWATCHMOUNTROUTE
How to read this

Two lanes: control plane above, data plane below. The Model Controller downloads weights once per namespace into a shared cache. The InferenceService Controller creates pods whose init container mounts that cache before the runtime starts. Cold-starting a new replica is a pod boot — no re-download.

Hover or focus a component for detail

Init Container Pattern

Fast cold starts
Models cached in persistent storage
Separation of concerns
Download logic separate from inference
Kubernetes-native
Standard patterns engineers already know
Available Now

What's working today

Production-validated features ready for your deployments

ModelRouter Phase 1

v0.7.8

One OpenAI-compatible endpoint that routes across local InferenceServices and external providers (Anthropic, OpenAI, LiteLLM, Bedrock, Vertex). Declarative rules match on data classification, capability, task complexity, or headers. Fail-closed for regulated data; agent code shrinks to "talk to this URL."

  • Fail-closed apply-time validator blocks PII rules pointing at cloud
  • Per-rule / per-backend / global timeouts (TTFT)
  • Half-open circuit breaker with cloud-tier connection lifecycle
  • Streaming SSE passthrough, no buffering, 32 MiB request cap
  • Structured audit log per request (rule, backend, latency, outcome)

OpenShift First-Class

v0.7.7

Deploy LLMKube on OpenShift, OKD, or MicroShift in one Helm command. The chart's values-openshift.yaml preset disables the operator's default fsGroup so the restricted-v2 SCC can inject its own from the namespace range. No SCC tuning, no oc adm policy dance.

  • Ships charts/llmkube/values-openshift.yaml preset
  • Controller flag --default-fsgroup (0 disables on SCC)
  • MicroShift-backed e2e CI job guards admission compatibility
  • Production-validated for regulated industries (healthcare, defense, finance)
  • Works alongside vanilla K8s on the same chart

mlx-server on Apple Silicon

v0.7.9

An OpenAI-compatible MLX inference server, managed by the metal-agent as a first-class runtime. Run brew install defilantech/tap/mlx-server, set spec.runtime: mlx-server, and the agent handles the process lifecycle, health probes, and memory pre-flight.

  • OpenAI-compatible API with streaming, tool calling, and reasoning split
  • 102.7 tok/s single-stream on an M5 Max (Qwen3.6-35B-A3B 8-bit)
  • 107 ms time-to-first-token
  • Same Prometheus / health-probe surface as the other runtimes
  • Installed from the Homebrew tap with binary auto-discovery

vllm-swift on Apple Silicon

v0.7.7

Native vLLM on Apple Silicon via TheTom's Swift bridge, as a first-class LLMKube runtime. Set spec.runtime: vllm-swift and the metal-agent handles the rest, with TurboQuant KV cache passthrough on the same CRD shape you use for CUDA.

  • vllm-swift binary auto-discovery (--vllm-swift-bin override)
  • TurboQuant quant-scheme + quant-bits passthrough via kvCacheCustomDtype
  • Same Prometheus / health-probe surface as the other runtimes
  • Sample manifest for Qwen3 4B FP8 + TurboQuant
  • Built on the M5 Max long-context benchmarks

Memory-Pressure Protection

Updated v0.7.7

Stop the metal-agent from killing your only inference workload when system memory spikes. Priority-based eviction, a friendly-fire guard for legitimate primary consumers, a per-service opt-out, and (new in 0.7.7) Kubernetes events on every pressure transition that kubectl describe picks up.

  • Watchdog levels with MemoryPressure status condition
  • Eviction-safety floor: never evicts the last managed process
  • evictionProtection: true per InferenceService
  • 50% RSS friendly-fire guard against external pressure spikes
  • Kubernetes events on pressure / eviction / respawn-blocked transitions

TurboQuant KV Cache

v0.7.3

First-class support for fork-specific KV cache types like TurboQuant turbo3, turbo4 on Metal and turbo2 on vLLM. Up to ~6.4× KV cache compression vs f16, which is what unlocks 256K–1M context for agentic coding on a single MacBook.

  • cacheTypeCustomK / cacheTypeCustomV for llama.cpp forks
  • kvCacheCustomDtype for vLLM v0.20+ turbo2
  • Cache-type-aware memory pre-flight check (no false OOM on TurboQuant)
  • Spec-drift respawn: kubectl patch isvc picks up KV cache changes
  • Cross-runtime: same CRD shape on Metal and CUDA

Apple Silicon Power Telemetry

v0.7.2

Live SoC power gauges (CPU + GPU + ANE) sourced from macOS powermetrics. Pairs with InferCost for end-to-end $/MTok cost attribution on M-series Macs, where DCGM doesn't exist.

  • Four Prometheus gauges: combined / GPU / CPU / ANE watts
  • Opt-in via --apple-power-enabled flag
  • One-command make install-powermetrics-sudo with pinned-argv NOPASSWD entry
  • Security audit fixed three findings before merge
  • Agrees with InferCost reading within ~1.6 W under sustained load

Hybrid GPU/CPU Offloading

v0.7.0

Run 30B+ MoE models on consumer GPUs. Expert weights in system RAM, active path on GPU, scheduler-aware host memory requests.

  • moeCPUOffload, moeCPULayers, noKvOffload fields
  • hostMemory request for scheduler placement
  • Tensor overrides and batch-size controls
  • Seven new runtime controls for llama.cpp and vLLM
  • HuggingFace repo ID source for runtime-resolved models

Pluggable Runtime Backends

Updated v0.7.9

Choose the best inference engine for your workload. One CRD, seven runtimes (including the new mlx-server MLX path for Apple Silicon), and FP8-friendly tuning via gpuMemoryUtilization + cpuOffloadGB on vLLM.

  • vLLM with PagedAttention, tensor parallelism, and per-rank CPU offload (community PR)
  • vllm-swift for native Apple Silicon with TurboQuant passthrough
  • mlx-server for OpenAI-compatible MLX inference on Apple Silicon
  • TGI (HuggingFace Text Generation Inference)
  • llama.cpp with GGUF and low-memory efficiency
  • Generic runtime for custom containers

HPA Autoscaling

v0.6.0

Scale inference replicas automatically based on real metrics. Per-runtime metric defaults.

  • Kubernetes HPA with custom inference metrics
  • Configurable min/max replicas and target values
  • Per-runtime default metrics via HPAMetricProvider
  • Works with Prometheus Adapter

Inference Metrics Dashboard

Updated v0.7.7

Pre-built Grafana dashboard plus shipped Prometheus recording rules for p95 latency and TTFT across runtimes. The new inference.llmkube.dev/runtime pod label is promoted onto every series so group-by-runtime queries are free.

  • Recording rules: p95 request latency, p95 TTFT, llama.cpp request latency
  • Per-model, per-runtime, per-namespace breakdowns out of the box
  • Import-ready JSON in docs/grafana/llmkube-inference.json
  • PodMonitor relabelings ship in the Helm chart

GPU Acceleration

Updated v0.6.0

NVIDIA CUDA 13 support with Blackwell GPUs. Custom layer splits via GPUShardingSpec.

  • CUDA 13 with Blackwell and Qwen3.5 support
  • Custom layer splits from GPUShardingSpec
  • 64 tok/s on Llama 3.2 3B (17x faster than CPU)
  • GKE + NVIDIA GPU Operator ready

Metal Agent

Three pluggable runtime backends for Apple Silicon inference. Choose the best fit for your workflow.

  • Ollama backend (200K+ users, auto model download)
  • oMLX backend (MLX-native, 40% faster)
  • llama-server backend (direct llama.cpp control)
  • M1/M2/M3/M4/M5 with health checks and metrics

KV Cache & Advanced Config

Fine-tune llama.cpp and vLLM behavior through the CRD. Standard cache types are enum-validated; fork-specific types pass straight through.

  • cacheTypeK / cacheTypeV (llama.cpp standard enum)
  • kvCacheDtype (vLLM auto / fp8_e5m2 / fp8_e4m3)
  • cacheTypeCustomK/V and kvCacheCustomDtype for fork values
  • extraArgs escape hatch for any runtime flag

GPU Queue Visibility

See where workloads stand in the GPU queue. Priority classes control scheduling.

  • Real-time queue position in status
  • GPU contention visibility
  • Priority classes for scheduling control

GPU Monitoring

Prometheus and Grafana integration with DCGM GPU metrics and pre-built dashboards.

  • GPU utilization, temp, power monitoring
  • Pre-built Grafana dashboards
  • Prometheus metrics integration

Kubernetes Native

Updated v0.7.6

Custom Resource Definitions for Model and InferenceService. Works with kubectl, GitOps, and the rest of your platform stack.

  • Mutable modelRef: swap models without recreating the service
  • runtimeClassName for gVisor, Kata, or custom runtimes
  • podAnnotations + podLabels for service mesh and policy
  • Declarative YAML, GitOps-ready, standard K8s patterns

Automatic Model Management

Download from HuggingFace, HTTP, or PVC sources. Automatic caching and SHA256 validation.

  • HuggingFace + HTTP + PVC sources
  • GGUF format support
  • Persistent volume caching

OpenAI Compatible

Drop-in replacement for OpenAI API. Use existing tools without code changes.

  • /v1/chat/completions endpoint
  • Streaming responses
  • Compatible with LangChain, etc.

CLI Tool

Simple command-line interface for deploying and managing LLM workloads.

  • Deploy with --gpu flag
  • List, status, delete commands
  • macOS, Linux, Windows binaries

CLI Benchmark Suites

Five test suites for comprehensive validation with automated sweeps and markdown reports.

  • llmkube benchmark --suite quick
  • Concurrency, context, and token sweeps
  • Markdown reports for sharing

Grafana Dashboard

Pre-built GPU observability dashboard for monitoring utilization, temperature, and memory.

LLMKube Grafana Dashboard
  • Multi-GPU monitoring with DCGM
  • Import-ready JSON in config/grafana/

Persistent Model Cache

Download models once, deploy instantly across services. Reduce bandwidth and startup times.

  • Per-namespace model cache PVC
  • Instant model switching
  • Configurable cache invalidation

Model Catalog

20+ pre-configured models. Deploy instantly with optimized settings.

  • One-command deployments
  • Llama, Mistral, Qwen, DeepSeek, Mixtral, Phi
  • Smart defaults with override support

Multi-GPU Support

Deploy 13B-70B+ models across GPUs with automatic layer sharding.

  • ~44 tok/s on Llama 13B with 2x RTX 5060 Ti
  • Automatic tensor split calculation
  • Layer-based sharding

Multi-Cloud Support

Deploy on any Kubernetes distribution. Standard K8s patterns.

  • Tested on GKE, kind, Minikube
  • Works on AKS, EKS, bare metal
  • Custom tolerations and nodeSelector

Helm Chart

Production-ready chart with 50+ configurable parameters.

  • helm install llmkube llmkube/llmkube
  • ImagePullSecrets for private registries
  • Namespace isolation, RBAC included

Validated performance

Real benchmarks from GKE deployments — CPU and GPU

CPU Baseline

TinyLlama 1.1B on GKE n2-standard-2
Token generation
~18.5 tok/s
Prompt processing
~29 tok/s
Response time
~1.5s (P50)
Cold start
~5s

GPU Accelerated

Llama 3.2 3B on GKE + NVIDIA L4
Token generation
~64 tok/s
Prompt processing
~1,026 tok/s
Response time
~0.6s
GPU memory
4.2 GB VRAM
Power usage
~35W
In Development

What's coming next

Multi-node sharding, SLO enforcement, and production hardening

v0.8.0

JSON Benchmark Output

Programmatic output for CI/CD integration and automated tracking.

  • JSON output format
  • CI pipeline integration
  • Diff comparison across runs

Runtime Extensions

Community-contributed runtimes and deeper integration with existing backends.

  • Contributor guide for adding runtimes
  • Runtime-specific health checks
  • Unified metrics across runtimes
Future phases

Multi-Node GPU Sharding

Distribute 70B+ models across multiple GPU nodes with intelligent layer scheduling and P2P KV cache sharing.

SLO Enforcement

Automatic monitoring, enforcement, and intelligent fallback mechanisms with latency-based routing.

Edge Optimization

Distribute inference across edge nodes with geo-aware placement and bandwidth-optimized routing.

Advanced Observability

Per-request cost tracking, quality monitoring, and advanced performance dashboards.

We build incrementally with production validation at each step. Roadmap subject to change based on community feedback.

Ready to get started?

Deploy your first GPU-accelerated LLM in minutes.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.