Skip to content
Skip to documentation content
Browse documentation
In progress Page being written

Multi-GPU sharding

Run a single model across multiple GPUs on the same node. The Model CR controls split strategy, the InferenceService CR allocates the GPUs.

What this page will cover

  • Split strategies: layer (default), tensor / row, none, and the pipeline alias.
  • Custom layer ranges via Model.spec.hardware.gpu.sharding.layerSplit.
  • Allocating GPUs on the InferenceService and matching the count in the Model.
  • Verifying the split actually happened by checking llama-server startup logs and the GPU memory split.
Read the source on GitHub
LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.