Skip to documentation content Read the source on GitHub
Browse documentation
Getting Started
Reference
In progress Page being written
Multi-GPU sharding
Run a single model across multiple GPUs on the same node. The Model CR controls split strategy, the InferenceService CR allocates the GPUs.
What this page will cover
- Split strategies: layer (default), tensor / row, none, and the pipeline alias.
- Custom layer ranges via Model.spec.hardware.gpu.sharding.layerSplit.
- Allocating GPUs on the InferenceService and matching the count in the Model.
- Verifying the split actually happened by checking llama-server startup logs and the GPU memory split.