Back to blog
Engineering 6 min read

Multi-GPU Support Ships: First Run on ShadowStack

Christopher Maher
Christopher Maher

LLMKube v0.4.0 is out, and it's a big one. Multi-GPU support with automatic layer-based sharding. We tested it on ShadowStack for the first time, and the results exceeded expectations. Here's what we learned.

The Promise Delivered

When we built ShadowStack, we said we'd use it to test real hardware scenarios and publish the results. This is that post.

ShadowStack has dual RTX 5060 Ti GPUs, each with 16 GB VRAM. That's 32 GB total, which should handle 70B models quantized to Q4. But the real question was: would LLMKube's multi-GPU implementation actually work on bare metal, outside the comfortable confines of cloud infrastructure?

Short answer: yes. And it was a smashing success.

The Numbers

We deployed Llama 2 13B quantized to Q4_K_M across both GPUs. Here's what we measured:

ModelLlama 2 13B Q4_K_M
Hardware2x NVIDIA RTX 5060 Ti 16 GB
Token Generation~44 tok/s
GPU Utilization45-53% on both GPUs
Sharding ModeLayer-based (--split-mode layer)

44 tokens per second on a 13B model. That's not lab numbers or theoretical projections. That's real inference, on real hardware, running a real Kubernetes workload through LLMKube.

How Multi-GPU Works in LLMKube

The implementation is straightforward. You specify how many GPUs you want, and LLMKube handles the rest:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: llama-13b
spec:
  modelRef:
    name: llama-13b-model
  accelerator:
    type: nvidia
    gpuCount: 2  # Split across 2 GPUs

Under the hood, LLMKube calculates the tensor split ratios automatically. For 2 GPUs, it's 50/50. For 4 GPUs, it's 25/25/25/25. The llama.cpp backend handles the actual layer distribution with --split-mode layer.

What makes this different from manually configuring multi-GPU inference? LLMKube abstracts the complexity. You don't need to calculate tensor splits, configure CUDA_VISIBLE_DEVICES, or worry about which layers go where. Define your intent, and the operator figures out the rest.

Multi-Cloud, Not Just Multi-GPU

v0.4.0 also brings true multi-cloud support. We removed all the hardcoded cloud-specific logic and added proper tolerations and nodeSelector fields to the API.

This means the same LLMKube deployment works on:

  • Google GKE with L4 or T4 GPUs
  • Azure AKS with spot instances (80% cost savings)
  • AWS EKS with P-instance or G-instance node groups
  • Bare metal like ShadowStack, K3s, or any vanilla Kubernetes cluster

The cloud examples are included in the repo. Terraform modules for all three major clouds, plus cloud-agnostic YAML for bare metal deployments.

What This Unlocks

Multi-GPU support opens up model sizes that were previously impossible on a single card:

  • 13B models that can now run comfortably on 2x mid-range GPUs
  • 70B models with Q4 quantization on 4x 16GB GPUs
  • Production workloads that need more headroom than a single GPU provides

For organizations running LLMs in regulated environments, air-gapped networks, or edge datacenters, this is a game changer. You don't need datacenter-class A100s anymore. A couple of consumer GPUs in a workstation can now run models that previously required enterprise hardware.

Lessons from the First Run

Running on ShadowStack taught us a few things that don't show up in cloud testing:

  • Thermal management matters. Both GPUs hit 45-53% utilization, which is actually ideal. Higher utilization would mean more heat and potential throttling. The layer-based sharding distributes work evenly.
  • PCIe bandwidth isn't the bottleneck. We expected inter-GPU communication to be a limiting factor, but llama.cpp's layer sharding keeps most computation local to each GPU.
  • Kubernetes abstractions hold up. The same tolerations and node selectors that work in GKE work on bare metal K3s. No cloud-specific hacks needed.

What's Next

This is just the beginning of what ShadowStack enables. Over the coming weeks, we'll be testing:

  • 70B models - ShadowStack's 32GB of VRAM should handle Llama 70B Q4. We'll find out.
  • Multi-model deployments - Running multiple smaller models simultaneously with resource isolation.
  • Failure scenarios - What happens when one GPU fails? How does Kubernetes handle it?
  • Air-gap installation - The full offline workflow with USB transfers and local registries.

All of this will be documented here, with real numbers from real hardware.

Try It Yourself

v0.4.0 is available now:

# Upgrade via Helm
helm repo update llmkube
helm upgrade llmkube llmkube/llmkube --namespace llmkube-system

# Or Kustomize
kubectl apply -k https://github.com/defilantech/LLMKube/config/default?ref=v0.4.7

The multi-GPU deployment guide walks through the complete setup. If you have multi-GPU hardware, we'd love to hear your results.

ShadowStack delivered exactly what we built it for: proof that LLMKube works in the real world, on real hardware, with real constraints. More tests to come.

Follow along: Watch the LLMKube GitHub repository for updates, or check the blog for more ShadowStack benchmarks.