Skip to content
Skip to documentation content
Browse documentation

Consumer-hardware model matrix

Snapshot dated 2026-05-11. Not actively maintained.

The local LLM landscape changes by the week. New model releases routinely supersede the ones in this table within a month or two, file sizes shift as quanters re-upload with better imatrix calibration, and license terms occasionally tighten. Treat this document as a starting point, not a source of truth. Before you download anything large, click through to the upstream HuggingFace model card and verify the current size, context length, and license for yourself.

If you spot something out of date, a pull request is welcome. We do not promise to keep this current.

A practical guide to picking a GGUF model for LLMKube based on the hardware you actually have. Every file size in this doc was verified directly from HuggingFace on 2026-05-11. Capability claims come from the upstream model cards (linked throughout), not third-party quanters.

How to read the tables

File size is the raw GGUF size on disk. To actually run the model with a usable context you need additional headroom for the KV cache, runtime overhead, and OS. The rough rule:

required memory = gguf_file_size * 1.20  +  kv_cache(ctx_length)

For most chat workloads at 8K to 16K context, add 20 to 30 percent to the file size and you will be close. Long context windows (128K+) or fp16 KV cache push this much higher. The KV cache headroom section below covers the math, and the LLMKube KV cache types and Memory-pressure protection docs document the runtime knobs.

Quant choice. Q4_K_M and IQ4_XS are the sweet spot for most users: good quality, smallest size that still feels like the original. Q5_K_M and Q6_K cost more memory but get closer to fp16 quality. Q8_0 is near lossless at roughly 1 byte per parameter. The Unsloth UD-*_XL variants are dynamic quants that mix bit widths smartly; usually preferred when available.

MoE models (Qwen3 Coder, Qwen3.5/3.6 A3B, gpt-oss-20b, gpt-oss-120b) are bigger on disk than their throughput suggests, because only a small subset of experts is active per token. They still need the full file resident in memory, but they generate as fast as a much smaller dense model. A 35B MoE with 3B active params loads like a 22 GB model at Q4 and generates like a 3B dense.

Hardware tiers

TierUsable memoryExample hardware
1: Edgeup to 8 GBRTX 3060 8GB, RTX 4060 8GB, MacBook Air M2/M3 8GB, Jetson Orin Nano, Steam Deck, integrated graphics
2: Entry12 to 16 GBRTX 3060 12GB, RTX 4060 Ti 16GB, RTX 4070 12GB, MacBook Pro M2/M3 Pro 16/18GB
3: Enthusiast24 to 32 GBRTX 3090 24GB, RTX 4090 24GB, RTX 5090 32GB, MacBook Pro M2/M3 Max 32GB, Mac Mini M4 24GB
4: Pro single-node48 to 64 GBRTX 6000 Ada 48GB, 2x RTX 3090, MacBook Pro M3/M4 Max 48/64GB, Mac Studio M2 Max 64GB
5: Workstation96 to 128 GBMacBook Pro M3/M4 Max 96/128GB, Mac Studio M2 Ultra, 2x RTX 4090/5090, RTX 6000 Pro 96GB
6: Multi-GPU / Ultra192 GB and upMac Studio M3/M4 Ultra 192/256/512GB, 4x RTX 5090, multi-RTX 6000 Pro, DGX-class

In LLMKube specifically, multi-GPU sharding lets you span tiers by adding GPUs. A 70B at Q4_K_M (about 40 GB) does not fit a single 24 GB RTX 4090, but it shards cleanly across 2x RTX 4090 with gpuCount: 2. See Multi-GPU sharding.

Mac unified memory: usable vs advertised

Apple Silicon shares one memory pool between CPU and GPU. That is an advantage (no copy, no VRAM split) but means macOS and background processes are eating into the same pool you want to load a model into.

Advertised RAMUsable for the model + KVRealistic Q4_K_M ceiling
8 GB~5 to 6 GB3B to 4B models
16 GB~12 to 13 GBup to 12B to 14B
24 GB~20 to 21 GBup to 20B to 27B
32 GB~28 to 29 GBup to a 35B MoE
48 GB~44 to 45 GB35B at Q8, some 70B at extreme quants
64 GB~60 GB70B at Q4_K_M (tight), 35B at Q8
96 GB+~90 GB and upgpt-oss-120b MXFP4, 70B at Q6, room for 128K context

Inference speed on Apple Silicon is bandwidth-limited. Memory bandwidth, not core count, is the bottleneck:

ChipMemory bandwidthNotes
M3 / M4 base~100 to 120 GB/sFine for 4B to 7B at Q4
M3 / M4 Pro~270 to 300 GB/sComfortable through 13B to 14B
M3 / M4 Max~400 to 540 GB/sThe sweet spot for 30B MoE workloads
M3 / M4 Ultra~800 GB/s+Frontier-class single-node hardware

Rough token/sec ballparks at Q4_K_M (interactive use):

Model sizeM2/M3 baseM2/M3 ProM3/M4 Max
4B dense25 to 3535 to 5050 to 80
8B dense15 to 2525 to 4040 to 60
14B dense8 to 1512 to 2020 to 35
30B MoE / 3B active15 to 2525 to 3530 to 50
70B densedoes not fitdoes not fit5 to 10

These are ballparks. Real numbers depend on context length, llama.cpp build flags, and other apps competing for memory bandwidth. Treat them as “is this interactive” not as benchmarks.

KV cache headroom

The KV cache grows roughly linearly with context length. As a back-of-envelope at Q4_K_M weights:

Model class8K context32K context128K context
7B to 8B dense~0.5 to 1 GB~2 to 4 GB~8 to 16 GB
13B to 14B dense~1 to 2 GB~4 to 8 GB~16 to 24 GB
30B MoE~1 to 2 GB~4 to 8 GB~16 to 24 GB
70B dense~2 to 4 GB~8 to 16 GBdoes not fit on consumer hw

Practical examples:

  • A 17 GB model on a 24 GB GPU leaves about 7 GB for KV cache. Fine for 8K to 16K context. Not enough for 128K.
  • gpt-oss-20b (10.8 GB) on a 16 GB card leaves about 5 GB. Comfortable at 16K. Tight at 32K.
  • Llama 3.3 70B Q4_K_M (39.6 GB) sharded across 2x RTX 4090 (48 GB total) leaves about 8 GB for KV. Stay under 32K context unless you switch to a Q4 KV cache type.

LLMKube exposes the KV cache dtype as a Model CRD field; see KV cache types for the trade-off (Q4 cache halves memory at a small quality cost).

Multi-GPU notes

llama.cpp supports multi-GPU inference via layer splitting. LLMKube wraps that with gpuCount on the Model CRD.

  • Two 12 GB cards can run a 20B to 24B model (10 to 12 GB of weights per GPU plus KV).
  • Two 24 GB cards comfortably hold a 70B at Q4_K_M (about 40 GB of weights split across two devices).
  • PCIe communication adds latency. For a given total VRAM budget, one bigger card beats two smaller cards in tokens/sec. Pick multi-GPU when you need the model to fit, not for speed.
  • Sharding shape matters. Layer-split (the default in llama.cpp and LLMKube) is simpler than tensor-parallel and works well for inference. Tensor-parallel needs NVLink-class interconnect to actually help.

The matrix, by hardware tier

Tier 1: Edge devices (up to 8 GB)

ModelBest forRecommended quantFile sizeLicenseContextNotes
Llama 3.2 1B InstructClassification, summarization, simple Q&A, autocompleteQ8_01.23 GBLlama 3.2128KMultilingual (8 languages), good for on-device.
Llama 3.2 3B InstructLightweight chat, RAG retrieval, function-call routingQ4_K_M1.88 GBLlama 3.2128KThe smallest model that still feels like an assistant. Q6_K (2.46 GB) if you have room.
Gemma 3 4B ITVision + text on edge, multilingual chatQ4_K_M2.32 GBGemma128KMultimodal (vision); 140+ languages. Add ~0.85 GB for the mmproj file.
Qwen3.5 4BLong-context summarization, vision, edge agentsUD-Q4_K_XL2.91 GBApache 2.0256K native, 1M with YaRNVision-language, thinking mode, hybrid MoE + DeltaNet. Add mmproj ~0.67 GB for vision.
Phi-4-mini InstructOn-device math, function calling, structured outputQ4_K_M~2.5 GBMIT128KStrongest small model for tool use.
Granite 4.1 3BEnterprise RAG / tool use on tight memoryQ4_K_M~2 GBApache 2.0128KMamba-Transformer hybrid; cheap KV cache at long context.

Tier 2: Entry GPUs / laptops (12 to 16 GB)

ModelBest forRecommended quantFile sizeLicenseContextNotes
Llama 3.1 8B InstructGeneral-purpose chat baseline, RAG, instruction followingQ4_K_M4.58 GBLlama 3.1128KThe default workhorse if in doubt. Q5_K_M (5.34 GB) for higher quality.
Granite 4.1 8BEnterprise tool use, RAG, function calling, fill-in-the-middle codeQ4_K_M4.98 GBApache 2.0128KBFCL v3 of 68.27, HumanEval 85.4. Strong for agentic backends.
Qwen3.5 9BVision + reasoning + long context in one packageQ4_K_M5.29 GBApache 2.0256K native, 1M with YaRNThinking mode by default; vision via mmproj; 201 languages.
Gemma 3 12B ITVision-language tasks, multilingual chat, document QAQ4_K_M7.30 GBGemma128K140+ languages. Add ~0.85 GB for mmproj.
DeepSeek R1 Distill Qwen 14BStep-by-step reasoning, math, multi-step problemsQ4_K_M8.99 GBMIT128KProduces visible thinking traces. Q5_K_M (10.51 GB) fits 16 GB.
Phi-4 14BMath, code, STEM reasoning on a tight memory budgetQ4_K_M8.43 GBMIT16KMATH 80.4, HumanEval 82.6, GPQA 56.1. Shorter context than peers.
gpt-oss-20bAgentic reasoning, function calling, Python executionQ4_K_M10.83 GBApache 2.0128KMoE: 20B total / 3.6B active. SWE-bench 60.7 at high reasoning. Note: Q8_0 is only 11.27 GB (model is natively MXFP4), so Q8 also fits 16 GB cards.

Tier 3: Enthusiast (24 to 32 GB)

ModelBest forRecommended quantFile sizeLicenseContextNotes
Mistral Small 3.2 24BVision + tools + chat, function calling, instruction followingQ4_K_M14.33 GBApache 2.0128KHumanEval+ 92.9, MBPP+ 78.3. Vision via mmproj.
Devstral Small 2 24BSoftware-engineering agents, multi-file edits, code explorationQ4_K_M14.33 GBApache 2.0256KSWE-Bench Verified 68.0. Designed for Cline / agent IDEs.
Qwen3.5 27BDense multimodal model, vision + reasoningQ4_K_M15.59 GBApache 2.0256KVision-language; thinking mode; 201 languages.
Qwen3 Coder 30B A3BAgentic coding, repo-scale understanding, browser-useQ4_K_M17.29 GBApache 2.0256K native, 1M with YaRNMoE: 30.5B / 3.3B active. Non-thinking, function-call native. Compatible with Qwen Code, CLINE.
DeepSeek R1 Distill Qwen 32BHeavy reasoning, math, long chain-of-thoughtQ4_K_M18.48 GBMIT128KVisible reasoning traces. Q5_K_M (21.65 GB) fits 32 GB cards.
Qwen3.6 35B A3BBest balance of speed + quality + vision in this tierUD-Q4_K_XL20.83 GBApache 2.0256KMoE: 35B / 3B active. SWE-bench Verified 73.4, AIME 2026 92.7, MMLU-Pro 85.2. Strong vision.
Llama 3.3 70B InstructProduction chat / multilingual at 70B qualityUD-IQ2_M22.62 GBLlama 3.3128KAggressive 2-bit dynamic quant. Quality dips vs Q4 but lands on a 24/32 GB device. Prefer Tier 4 if you have it.

Tier 4: Pro single-node (48 to 64 GB)

ModelBest forRecommended quantFile sizeLicenseContextNotes
Llama 3.3 70B InstructFrontier-class general chat, multilingual, tool useQ4_K_M39.61 GBLlama 3.3128KThe 70B everyone benchmarks against. Shards 2x 24 GB GPUs. Q5_K_M (46.53 GB) on 48 GB cards / 64 GB Macs.
DeepSeek R1 Distill Llama 70BHeavy reasoning at 70B scaleQ4_K_M~42 GBLlama 3.3128KSame shape as Llama 3.3 70B with R1-style thinking traces.
Qwen3.6 35B A3BHigh-quality multimodal MoE with headroom for contextQ6_K27.30 GBApache 2.0256KComfortable here; can run 128K+ context without IQ tricks.
Mistral Large 2411 (123B)Multilingual frontier-class on Apple Ultra or multi-GPUUD-IQ2_M~40 GBMistral Research / Commercial128KTight at 48 GB; 64 GB Mac is the natural home with extreme quants. Note this license is not Apache 2.0.

Tier 5: Workstation (96 to 128 GB)

ModelBest forRecommended quantFile sizeLicenseContextNotes
gpt-oss-120bFrontier-class reasoning, agentic, 80GB+ single-hostMXFP4_MOE59.02 GB (sharded 2 files)Apache 2.0128KMoE: 117B / 5.1B active. Built to run on a single 80 GB H100. Fits one 96 GB Mac or sharded across 2x 48 GB GPUs. SWE-bench 62.4, GPQA Diamond 80.9 at high reasoning.
Llama 3.3 70B InstructHighest-quality 70B you can run locallyQ6_K~58 GB (community)Llama 3.3128KQ8_0 (~70 GB) comfortably fits with KV cache headroom.
Devstral 2 123B (larger sibling)Multi-file agentic SWE at scaleUD-IQ4_XS~68 GBApache 2.0256KFor serious code agents. Verify the specific sibling repo for your machine.

Tier 6: Multi-GPU / Ultra (192 GB and up)

ModelBest forRecommended quantFile sizeLicenseContextNotes
gpt-oss-120bMax-quality reasoning with long contextQ8_060.79 GB (sharded)Apache 2.0128KQuality-near-fp16 in this tier; leaves room for 128K KV cache.
DeepSeek V4 Flash (very large MoE)Frontier coding / reasoning, long contextcommunity IQ2/IQ4200 GB+MIT128K+Designed for Apple Silicon Metal with extreme quants per community quanters. Verify the specific quant against your machine.
Multiple Llama 3.3 70B replicasHigh-throughput servingQ4_K_M39.61 GB eachLlama 3.3128KUse LLMKube replicas: N to scale horizontally across GPUs / nodes.

The matrix, by task

Same models, sorted by what you want to do. Numbers cited come from the upstream model cards.

General assistant / chat / RAG

TierFirst pickBackup
1Llama 3.2 3B (1.88 GB)Gemma 3 4B (2.32 GB)
2Llama 3.1 8B (4.58 GB)Qwen3.5 9B (5.29 GB)
3Mistral Small 3.2 24B (14.33 GB)Qwen3.5 27B (15.59 GB)
4Llama 3.3 70B Q4_K_M (39.61 GB)Qwen3.6 35B A3B Q6_K (27.30 GB)
5+gpt-oss-120b MXFP4 (59 GB)Llama 3.3 70B Q8_0

Coding / agentic SWE

TierFirst pickNotes
1Qwen3.5 4B (2.74 GB)Smallest model with usable code skills.
2Granite 4.1 8B Q4_K_M (4.98 GB)HumanEval 85.4, MBPP 87.3, native FIM.
2gpt-oss-20b Q4_K_M (10.83 GB)SWE-bench 60.7 at high reasoning.
3Qwen3 Coder 30B A3B (17.29 GB)Best open agent-IDE model in this size class; 256K context.
3Devstral Small 2 24B (14.33 GB)SWE-Bench Verified 68.0; built for multi-file edits.
4Qwen3 Coder 30B A3B Q6_K (~23 GB)Same model, more quality headroom.
5+gpt-oss-120bSWE-bench 62.4, broader knowledge.

Reasoning / math / STEM

TierFirst pickNotes
1Phi-4-mini reasoning (~2.5 GB)Tiny MIT-licensed reasoner.
2Phi-4 14B Q4_K_M (8.43 GB)MATH 80.4, GPQA 56.1, MIT.
2DeepSeek R1 Distill Qwen 14B (8.99 GB)Visible chain-of-thought.
3DeepSeek R1 Distill Qwen 32B (18.48 GB)Deeper reasoning, longer traces.
3Qwen3.6 35B A3B (20.83 GB)AIME 2026 92.7, MMLU-Pro 85.2 with thinking on.
5+gpt-oss-120b (high reasoning)GPQA Diamond up to 80.9.

Vision / multimodal

llama.cpp loads the vision projector from a separate mmproj-*.gguf file alongside the language GGUF. Account for both.

TierFirst pickmmproj size
1Gemma 3 4B (2.32 GB)~0.85 GB
1Qwen3.5 4B (2.74 GB)~0.67 GB
2Gemma 3 12B (7.30 GB)~0.85 GB
2Qwen3.5 9B (5.29 GB)~0.85 GB
3Mistral Small 3.2 24B (14.33 GB)~0.88 GB
3Qwen3.5 27B (15.59 GB)~0.86 GB
3Qwen3.6 35B A3B (20.83 GB)~0.84 GB

Function calling / tool use / agents

TierFirst pickNotes
1Phi-4-mini InstructStrong tool use at edge sizes.
2Granite 4.1 8BBFCL v3 of 68.27, dedicated tool-call template.
2gpt-oss-20bNative function calling, web/python tools.
3Qwen3 Coder 30B A3BNative function-call format, designed for agent loops.
3Mistral Small 3.2 24BMistral function-call template, vLLM tool parser.
5+gpt-oss-120bProduction-grade tool use at frontier quality.

Long context (100K+ tokens)

TierFirst pickNative context
1Qwen3.5 4B256K (1M with YaRN)
2Qwen3.5 9B256K (1M with YaRN)
2Granite 4.1 8B128K, cheap KV cache (Mamba hybrid)
3Qwen3 Coder 30B A3B256K native, 1M with YaRN
3Qwen3.6 35B A3B256K native, 1M with YaRN
4+Devstral Small 2 24B256K

Phi-4 14B is only 16K context; do not pick it for long-document workloads.

What this doc deliberately does not include

  • Uncensored / abliterated / “heretic” variants. They show up high in HuggingFace trending but I want this list to be reproducible.
  • Hobbyist frankenmerges (anything labeled “X-Distilled-GGUF” from non-vendor authors). The base models above are the inputs to most of those; pick the base unless you have a specific reason.
  • Pure base models (no -Instruct / -it suffix). LLMKube users almost always want instruct variants.
  • Embedding / reranker GGUFs. Those belong in a separate matrix for RAG infrastructure.
  • GGUFs marked imatrix-only (calibration files, not runnable models).
  • CPU-only inference benchmarks. Possible but typically 1 to 5 tok/s for a 7B; this doc assumes GPU or Apple Silicon.

Methodology and sources

Every file size in this doc was fetched from the HuggingFace tree API on 2026-05-11. Capability claims (context length, license, benchmarks) come from the upstream meta-llama/, Qwen/, mistralai/, google/, microsoft/, openai/, deepseek-ai/, and ibm-granite/ model cards, not from third-party quanters. The Unsloth and Bartowski GGUF repos are referenced because they are the most-downloaded community quants with reproducible naming, but the underlying weights and licenses follow the upstream publisher.

When in doubt, prefer Q4_K_M as the starting quant, measure quality on your own task, and step up to Q5_K_M or Q6_K only if you see issues. Going below Q4 (IQ3, Q2, IQ2, IQ1) trades real quality for memory; only do it if the model otherwise will not fit.

A reminder that this will go stale

The local LLM ecosystem is one of the fastest-moving spaces in software. Between when this was written and when you are reading it, expect that:

  • New model generations (Llama 4.x, Qwen3.7, Gemma 5, etc.) have probably shifted the recommendations.
  • Quanters have re-uploaded files with better imatrix calibration; sizes shift by a few percent.
  • Tools like llama.cpp and llmkube have added new quant types, faster KV cache modes, or runtime backends.
  • License terms occasionally change (especially Llama and Gemma).

Use this as a structured starting point. Always click through to the upstream HuggingFace model card before downloading a large file. If you find this doc is meaningfully wrong, a PR with the corrected row is appreciated, but please do not assume any single number here is current the day you read it.

LLMKube LLMKube

Kubernetes for Local LLMs. Deploy, manage, and scale AI inference workloads with production-grade orchestration.

© 2026 Defilan Technologies LLC

Community

Built for the Kubernetes and AI communities

LLMKube is not affiliated with or endorsed by the Cloud Native Computing Foundation or the Kubernetes project. Kubernetes® is a registered trademark of The Linux Foundation.