Thanksgiving 2025: Gratitude, Benchmarks, and Building in the Open
It's the day before Thanksgiving, and as I sit here looking at the latest benchmark results, I find myself reflecting on the journey I've taken with LLMKube. Building in the open isn't always easy, but moments like this remind me why it's worth it.
Gratitude First
Before diving into the numbers (and trust me, they're good), I want to acknowledge what makes this project possible.
To the folks who've starred the repo: thank you! We're small but mighty right now, and each star is someone who saw what we're building and thought it was worth remembering. The issues and discussions are still quiet (it's mostly me talking to myself in there), but that's okay. Every open source project starts somewhere, and I'm building for the long term.
We're building for platform engineers who need to deploy AI in environments where cloud APIs aren't an option. We're building for teams in healthcare, defense, and manufacturing who need inference to stay local. Those people may not have found LLMKube yet, but when they do, I want it to be ready.
To the open source maintainers whose shoulders we stand on: the llama.cpp team, the Kubernetes community, the Go ecosystem. Your work makes mine possible. Open source is a gift economy, and I'm grateful to be part of it.
The Numbers That Made Me Smile
Yesterday, I ran the latest catalog benchmarks on ShadowStack. If you've been following along, you know ShadowStack is my bare-metal testing lab with dual RTX 5060 Ti GPUs. Here's what I measured:
Catalog Benchmark Results (November 26, 2025)
| Model | Generation | P50 Latency |
|---|---|---|
| Llama 3.2 3B | 68.7 tok/s | 1.46s |
| Mistral 7B v0.3 | 65.3 tok/s | 1.15s |
| Llama 3.1 8B | 63.4 tok/s | 1.70s |
| Llama 2 13B (2x GPU) | 44 tok/s | ~2s |
These aren't synthetic benchmarks. This is real inference, through real Kubernetes workloads, on real hardware. The consistency across models is what impressed me most: only an 8% throughput drop going from 3B to 8B parameters. And with P99 latencies within 15-20% of P50, I'm seeing the stability that production workloads demand.
For multi-GPU workloads, I'm still seeing excellent results on larger models:
Multi-GPU Results (Llama 2 13B)
- Generation speed: 44-45 tok/s across 2x RTX 5060 Ti
- Prompt processing: Up to 726 tok/s
- GPU utilization: 45-53% on both GPUs (ideal thermal headroom)
- Variance: <1% across sequential requests
What v0.4.7 Brings
These benchmarks were made possible by tools I've added since v0.4.0. The new llmkube benchmark command lets you measure your own deployments with a single command. No more scripting curl requests and parsing JSON. Just run the benchmark and get P50, P95, P99 latencies along with throughput metrics.
I've also added persistent model caching. Download a model once, and it's instantly available for every subsequent deployment in that namespace. For teams iterating quickly or running multiple services off the same base model, this cuts deployment times dramatically.
Building in the Open
There's something vulnerable about publishing every commit, every decision, every benchmark result. I could wait until everything is polished, until the roadmap is complete, until there are enterprise customers to validate the approach.
But that's not how the best infrastructure software gets built. Kubernetes itself was built in the open. So was Docker, Prometheus, and countless other tools that platform engineers rely on every day.
So I publish the benchmarks, the good ones and the ones that show me where to improve. I document the decisions, even when I'm not certain they're right. I ship features before they're perfect, because real feedback beats theoretical planning every time.
Looking Ahead
Tomorrow is Thanksgiving. For those celebrating, I hope you get some time away from terminals and YAML files. Hug someone you love, eat some great food, and remember that the code will still be there next week.
When you come back, the project will still be here. There's auto-scaling to build, 70B model tests to run, and a whole roadmap of features waiting to be implemented. But for today, I'm just grateful. Grateful for the progress of LLMKube, for the community that's forming around this project, and for the chance to build something that might actually help people run AI workloads where they need them most.
Happy Thanksgiving!
Try it yourself: Get started with helm install llmkube llmkube/llmkube and run your own benchmarks with llmkube benchmark. See the getting started guide for full instructions.