Consumer-hardware model matrix
Snapshot dated 2026-05-11. Not actively maintained.
The local LLM landscape changes by the week. New model releases routinely supersede the ones in this table within a month or two, file sizes shift as quanters re-upload with better imatrix calibration, and license terms occasionally tighten. Treat this document as a starting point, not a source of truth. Before you download anything large, click through to the upstream HuggingFace model card and verify the current size, context length, and license for yourself.
If you spot something out of date, a pull request is welcome. We do not promise to keep this current.
A practical guide to picking a GGUF model for LLMKube based on the hardware you actually have. Every file size in this doc was verified directly from HuggingFace on 2026-05-11. Capability claims come from the upstream model cards (linked throughout), not third-party quanters.
How to read the tables
File size is the raw GGUF size on disk. To actually run the model with a usable context you need additional headroom for the KV cache, runtime overhead, and OS. The rough rule:
required memory = gguf_file_size * 1.20 + kv_cache(ctx_length) For most chat workloads at 8K to 16K context, add 20 to 30 percent to the file size and you will be close. Long context windows (128K+) or fp16 KV cache push this much higher. The KV cache headroom section below covers the math, and the LLMKube KV cache types and Memory-pressure protection docs document the runtime knobs.
Quant choice. Q4_K_M and IQ4_XS are the sweet spot for most users: good quality, smallest size that still feels like the original. Q5_K_M and Q6_K cost more memory but get closer to fp16 quality. Q8_0 is near lossless at roughly 1 byte per parameter. The Unsloth UD-*_XL variants are dynamic quants that mix bit widths smartly; usually preferred when available.
MoE models (Qwen3 Coder, Qwen3.5/3.6 A3B, gpt-oss-20b, gpt-oss-120b) are bigger on disk than their throughput suggests, because only a small subset of experts is active per token. They still need the full file resident in memory, but they generate as fast as a much smaller dense model. A 35B MoE with 3B active params loads like a 22 GB model at Q4 and generates like a 3B dense.
Hardware tiers
| Tier | Usable memory | Example hardware |
|---|---|---|
| 1: Edge | up to 8 GB | RTX 3060 8GB, RTX 4060 8GB, MacBook Air M2/M3 8GB, Jetson Orin Nano, Steam Deck, integrated graphics |
| 2: Entry | 12 to 16 GB | RTX 3060 12GB, RTX 4060 Ti 16GB, RTX 4070 12GB, MacBook Pro M2/M3 Pro 16/18GB |
| 3: Enthusiast | 24 to 32 GB | RTX 3090 24GB, RTX 4090 24GB, RTX 5090 32GB, MacBook Pro M2/M3 Max 32GB, Mac Mini M4 24GB |
| 4: Pro single-node | 48 to 64 GB | RTX 6000 Ada 48GB, 2x RTX 3090, MacBook Pro M3/M4 Max 48/64GB, Mac Studio M2 Max 64GB |
| 5: Workstation | 96 to 128 GB | MacBook Pro M3/M4 Max 96/128GB, Mac Studio M2 Ultra, 2x RTX 4090/5090, RTX 6000 Pro 96GB |
| 6: Multi-GPU / Ultra | 192 GB and up | Mac Studio M3/M4 Ultra 192/256/512GB, 4x RTX 5090, multi-RTX 6000 Pro, DGX-class |
In LLMKube specifically, multi-GPU sharding lets you span tiers by adding GPUs. A 70B at Q4_K_M (about 40 GB) does not fit a single 24 GB RTX 4090, but it shards cleanly across 2x RTX 4090 with gpuCount: 2. See Multi-GPU sharding.
Mac unified memory: usable vs advertised
Apple Silicon shares one memory pool between CPU and GPU. That is an advantage (no copy, no VRAM split) but means macOS and background processes are eating into the same pool you want to load a model into.
| Advertised RAM | Usable for the model + KV | Realistic Q4_K_M ceiling |
|---|---|---|
| 8 GB | ~5 to 6 GB | 3B to 4B models |
| 16 GB | ~12 to 13 GB | up to 12B to 14B |
| 24 GB | ~20 to 21 GB | up to 20B to 27B |
| 32 GB | ~28 to 29 GB | up to a 35B MoE |
| 48 GB | ~44 to 45 GB | 35B at Q8, some 70B at extreme quants |
| 64 GB | ~60 GB | 70B at Q4_K_M (tight), 35B at Q8 |
| 96 GB+ | ~90 GB and up | gpt-oss-120b MXFP4, 70B at Q6, room for 128K context |
Inference speed on Apple Silicon is bandwidth-limited. Memory bandwidth, not core count, is the bottleneck:
| Chip | Memory bandwidth | Notes |
|---|---|---|
| M3 / M4 base | ~100 to 120 GB/s | Fine for 4B to 7B at Q4 |
| M3 / M4 Pro | ~270 to 300 GB/s | Comfortable through 13B to 14B |
| M3 / M4 Max | ~400 to 540 GB/s | The sweet spot for 30B MoE workloads |
| M3 / M4 Ultra | ~800 GB/s+ | Frontier-class single-node hardware |
Rough token/sec ballparks at Q4_K_M (interactive use):
| Model size | M2/M3 base | M2/M3 Pro | M3/M4 Max |
|---|---|---|---|
| 4B dense | 25 to 35 | 35 to 50 | 50 to 80 |
| 8B dense | 15 to 25 | 25 to 40 | 40 to 60 |
| 14B dense | 8 to 15 | 12 to 20 | 20 to 35 |
| 30B MoE / 3B active | 15 to 25 | 25 to 35 | 30 to 50 |
| 70B dense | does not fit | does not fit | 5 to 10 |
These are ballparks. Real numbers depend on context length, llama.cpp build flags, and other apps competing for memory bandwidth. Treat them as “is this interactive” not as benchmarks.
KV cache headroom
The KV cache grows roughly linearly with context length. As a back-of-envelope at Q4_K_M weights:
| Model class | 8K context | 32K context | 128K context |
|---|---|---|---|
| 7B to 8B dense | ~0.5 to 1 GB | ~2 to 4 GB | ~8 to 16 GB |
| 13B to 14B dense | ~1 to 2 GB | ~4 to 8 GB | ~16 to 24 GB |
| 30B MoE | ~1 to 2 GB | ~4 to 8 GB | ~16 to 24 GB |
| 70B dense | ~2 to 4 GB | ~8 to 16 GB | does not fit on consumer hw |
Practical examples:
- A 17 GB model on a 24 GB GPU leaves about 7 GB for KV cache. Fine for 8K to 16K context. Not enough for 128K.
- gpt-oss-20b (10.8 GB) on a 16 GB card leaves about 5 GB. Comfortable at 16K. Tight at 32K.
- Llama 3.3 70B Q4_K_M (39.6 GB) sharded across 2x RTX 4090 (48 GB total) leaves about 8 GB for KV. Stay under 32K context unless you switch to a Q4 KV cache type.
LLMKube exposes the KV cache dtype as a Model CRD field; see KV cache types for the trade-off (Q4 cache halves memory at a small quality cost).
Multi-GPU notes
llama.cpp supports multi-GPU inference via layer splitting. LLMKube wraps that with gpuCount on the Model CRD.
- Two 12 GB cards can run a 20B to 24B model (10 to 12 GB of weights per GPU plus KV).
- Two 24 GB cards comfortably hold a 70B at Q4_K_M (about 40 GB of weights split across two devices).
- PCIe communication adds latency. For a given total VRAM budget, one bigger card beats two smaller cards in tokens/sec. Pick multi-GPU when you need the model to fit, not for speed.
- Sharding shape matters. Layer-split (the default in llama.cpp and LLMKube) is simpler than tensor-parallel and works well for inference. Tensor-parallel needs NVLink-class interconnect to actually help.
The matrix, by hardware tier
Tier 1: Edge devices (up to 8 GB)
| Model | Best for | Recommended quant | File size | License | Context | Notes |
|---|---|---|---|---|---|---|
| Llama 3.2 1B Instruct | Classification, summarization, simple Q&A, autocomplete | Q8_0 | 1.23 GB | Llama 3.2 | 128K | Multilingual (8 languages), good for on-device. |
| Llama 3.2 3B Instruct | Lightweight chat, RAG retrieval, function-call routing | Q4_K_M | 1.88 GB | Llama 3.2 | 128K | The smallest model that still feels like an assistant. Q6_K (2.46 GB) if you have room. |
| Gemma 3 4B IT | Vision + text on edge, multilingual chat | Q4_K_M | 2.32 GB | Gemma | 128K | Multimodal (vision); 140+ languages. Add ~0.85 GB for the mmproj file. |
| Qwen3.5 4B | Long-context summarization, vision, edge agents | UD-Q4_K_XL | 2.91 GB | Apache 2.0 | 256K native, 1M with YaRN | Vision-language, thinking mode, hybrid MoE + DeltaNet. Add mmproj ~0.67 GB for vision. |
| Phi-4-mini Instruct | On-device math, function calling, structured output | Q4_K_M | ~2.5 GB | MIT | 128K | Strongest small model for tool use. |
| Granite 4.1 3B | Enterprise RAG / tool use on tight memory | Q4_K_M | ~2 GB | Apache 2.0 | 128K | Mamba-Transformer hybrid; cheap KV cache at long context. |
Tier 2: Entry GPUs / laptops (12 to 16 GB)
| Model | Best for | Recommended quant | File size | License | Context | Notes |
|---|---|---|---|---|---|---|
| Llama 3.1 8B Instruct | General-purpose chat baseline, RAG, instruction following | Q4_K_M | 4.58 GB | Llama 3.1 | 128K | The default workhorse if in doubt. Q5_K_M (5.34 GB) for higher quality. |
| Granite 4.1 8B | Enterprise tool use, RAG, function calling, fill-in-the-middle code | Q4_K_M | 4.98 GB | Apache 2.0 | 128K | BFCL v3 of 68.27, HumanEval 85.4. Strong for agentic backends. |
| Qwen3.5 9B | Vision + reasoning + long context in one package | Q4_K_M | 5.29 GB | Apache 2.0 | 256K native, 1M with YaRN | Thinking mode by default; vision via mmproj; 201 languages. |
| Gemma 3 12B IT | Vision-language tasks, multilingual chat, document QA | Q4_K_M | 7.30 GB | Gemma | 128K | 140+ languages. Add ~0.85 GB for mmproj. |
| DeepSeek R1 Distill Qwen 14B | Step-by-step reasoning, math, multi-step problems | Q4_K_M | 8.99 GB | MIT | 128K | Produces visible thinking traces. Q5_K_M (10.51 GB) fits 16 GB. |
| Phi-4 14B | Math, code, STEM reasoning on a tight memory budget | Q4_K_M | 8.43 GB | MIT | 16K | MATH 80.4, HumanEval 82.6, GPQA 56.1. Shorter context than peers. |
| gpt-oss-20b | Agentic reasoning, function calling, Python execution | Q4_K_M | 10.83 GB | Apache 2.0 | 128K | MoE: 20B total / 3.6B active. SWE-bench 60.7 at high reasoning. Note: Q8_0 is only 11.27 GB (model is natively MXFP4), so Q8 also fits 16 GB cards. |
Tier 3: Enthusiast (24 to 32 GB)
| Model | Best for | Recommended quant | File size | License | Context | Notes |
|---|---|---|---|---|---|---|
| Mistral Small 3.2 24B | Vision + tools + chat, function calling, instruction following | Q4_K_M | 14.33 GB | Apache 2.0 | 128K | HumanEval+ 92.9, MBPP+ 78.3. Vision via mmproj. |
| Devstral Small 2 24B | Software-engineering agents, multi-file edits, code exploration | Q4_K_M | 14.33 GB | Apache 2.0 | 256K | SWE-Bench Verified 68.0. Designed for Cline / agent IDEs. |
| Qwen3.5 27B | Dense multimodal model, vision + reasoning | Q4_K_M | 15.59 GB | Apache 2.0 | 256K | Vision-language; thinking mode; 201 languages. |
| Qwen3 Coder 30B A3B | Agentic coding, repo-scale understanding, browser-use | Q4_K_M | 17.29 GB | Apache 2.0 | 256K native, 1M with YaRN | MoE: 30.5B / 3.3B active. Non-thinking, function-call native. Compatible with Qwen Code, CLINE. |
| DeepSeek R1 Distill Qwen 32B | Heavy reasoning, math, long chain-of-thought | Q4_K_M | 18.48 GB | MIT | 128K | Visible reasoning traces. Q5_K_M (21.65 GB) fits 32 GB cards. |
| Qwen3.6 35B A3B | Best balance of speed + quality + vision in this tier | UD-Q4_K_XL | 20.83 GB | Apache 2.0 | 256K | MoE: 35B / 3B active. SWE-bench Verified 73.4, AIME 2026 92.7, MMLU-Pro 85.2. Strong vision. |
| Llama 3.3 70B Instruct | Production chat / multilingual at 70B quality | UD-IQ2_M | 22.62 GB | Llama 3.3 | 128K | Aggressive 2-bit dynamic quant. Quality dips vs Q4 but lands on a 24/32 GB device. Prefer Tier 4 if you have it. |
Tier 4: Pro single-node (48 to 64 GB)
| Model | Best for | Recommended quant | File size | License | Context | Notes |
|---|---|---|---|---|---|---|
| Llama 3.3 70B Instruct | Frontier-class general chat, multilingual, tool use | Q4_K_M | 39.61 GB | Llama 3.3 | 128K | The 70B everyone benchmarks against. Shards 2x 24 GB GPUs. Q5_K_M (46.53 GB) on 48 GB cards / 64 GB Macs. |
| DeepSeek R1 Distill Llama 70B | Heavy reasoning at 70B scale | Q4_K_M | ~42 GB | Llama 3.3 | 128K | Same shape as Llama 3.3 70B with R1-style thinking traces. |
| Qwen3.6 35B A3B | High-quality multimodal MoE with headroom for context | Q6_K | 27.30 GB | Apache 2.0 | 256K | Comfortable here; can run 128K+ context without IQ tricks. |
| Mistral Large 2411 (123B) | Multilingual frontier-class on Apple Ultra or multi-GPU | UD-IQ2_M | ~40 GB | Mistral Research / Commercial | 128K | Tight at 48 GB; 64 GB Mac is the natural home with extreme quants. Note this license is not Apache 2.0. |
Tier 5: Workstation (96 to 128 GB)
| Model | Best for | Recommended quant | File size | License | Context | Notes |
|---|---|---|---|---|---|---|
| gpt-oss-120b | Frontier-class reasoning, agentic, 80GB+ single-host | MXFP4_MOE | 59.02 GB (sharded 2 files) | Apache 2.0 | 128K | MoE: 117B / 5.1B active. Built to run on a single 80 GB H100. Fits one 96 GB Mac or sharded across 2x 48 GB GPUs. SWE-bench 62.4, GPQA Diamond 80.9 at high reasoning. |
| Llama 3.3 70B Instruct | Highest-quality 70B you can run locally | Q6_K | ~58 GB (community) | Llama 3.3 | 128K | Q8_0 (~70 GB) comfortably fits with KV cache headroom. |
| Devstral 2 123B (larger sibling) | Multi-file agentic SWE at scale | UD-IQ4_XS | ~68 GB | Apache 2.0 | 256K | For serious code agents. Verify the specific sibling repo for your machine. |
Tier 6: Multi-GPU / Ultra (192 GB and up)
| Model | Best for | Recommended quant | File size | License | Context | Notes |
|---|---|---|---|---|---|---|
| gpt-oss-120b | Max-quality reasoning with long context | Q8_0 | 60.79 GB (sharded) | Apache 2.0 | 128K | Quality-near-fp16 in this tier; leaves room for 128K KV cache. |
| DeepSeek V4 Flash (very large MoE) | Frontier coding / reasoning, long context | community IQ2/IQ4 | 200 GB+ | MIT | 128K+ | Designed for Apple Silicon Metal with extreme quants per community quanters. Verify the specific quant against your machine. |
| Multiple Llama 3.3 70B replicas | High-throughput serving | Q4_K_M | 39.61 GB each | Llama 3.3 | 128K | Use LLMKube replicas: N to scale horizontally across GPUs / nodes. |
The matrix, by task
Same models, sorted by what you want to do. Numbers cited come from the upstream model cards.
General assistant / chat / RAG
| Tier | First pick | Backup |
|---|---|---|
| 1 | Llama 3.2 3B (1.88 GB) | Gemma 3 4B (2.32 GB) |
| 2 | Llama 3.1 8B (4.58 GB) | Qwen3.5 9B (5.29 GB) |
| 3 | Mistral Small 3.2 24B (14.33 GB) | Qwen3.5 27B (15.59 GB) |
| 4 | Llama 3.3 70B Q4_K_M (39.61 GB) | Qwen3.6 35B A3B Q6_K (27.30 GB) |
| 5+ | gpt-oss-120b MXFP4 (59 GB) | Llama 3.3 70B Q8_0 |
Coding / agentic SWE
| Tier | First pick | Notes |
|---|---|---|
| 1 | Qwen3.5 4B (2.74 GB) | Smallest model with usable code skills. |
| 2 | Granite 4.1 8B Q4_K_M (4.98 GB) | HumanEval 85.4, MBPP 87.3, native FIM. |
| 2 | gpt-oss-20b Q4_K_M (10.83 GB) | SWE-bench 60.7 at high reasoning. |
| 3 | Qwen3 Coder 30B A3B (17.29 GB) | Best open agent-IDE model in this size class; 256K context. |
| 3 | Devstral Small 2 24B (14.33 GB) | SWE-Bench Verified 68.0; built for multi-file edits. |
| 4 | Qwen3 Coder 30B A3B Q6_K (~23 GB) | Same model, more quality headroom. |
| 5+ | gpt-oss-120b | SWE-bench 62.4, broader knowledge. |
Reasoning / math / STEM
| Tier | First pick | Notes |
|---|---|---|
| 1 | Phi-4-mini reasoning (~2.5 GB) | Tiny MIT-licensed reasoner. |
| 2 | Phi-4 14B Q4_K_M (8.43 GB) | MATH 80.4, GPQA 56.1, MIT. |
| 2 | DeepSeek R1 Distill Qwen 14B (8.99 GB) | Visible chain-of-thought. |
| 3 | DeepSeek R1 Distill Qwen 32B (18.48 GB) | Deeper reasoning, longer traces. |
| 3 | Qwen3.6 35B A3B (20.83 GB) | AIME 2026 92.7, MMLU-Pro 85.2 with thinking on. |
| 5+ | gpt-oss-120b (high reasoning) | GPQA Diamond up to 80.9. |
Vision / multimodal
llama.cpp loads the vision projector from a separate mmproj-*.gguf file alongside the language GGUF. Account for both.
| Tier | First pick | mmproj size |
|---|---|---|
| 1 | Gemma 3 4B (2.32 GB) | ~0.85 GB |
| 1 | Qwen3.5 4B (2.74 GB) | ~0.67 GB |
| 2 | Gemma 3 12B (7.30 GB) | ~0.85 GB |
| 2 | Qwen3.5 9B (5.29 GB) | ~0.85 GB |
| 3 | Mistral Small 3.2 24B (14.33 GB) | ~0.88 GB |
| 3 | Qwen3.5 27B (15.59 GB) | ~0.86 GB |
| 3 | Qwen3.6 35B A3B (20.83 GB) | ~0.84 GB |
Function calling / tool use / agents
| Tier | First pick | Notes |
|---|---|---|
| 1 | Phi-4-mini Instruct | Strong tool use at edge sizes. |
| 2 | Granite 4.1 8B | BFCL v3 of 68.27, dedicated tool-call template. |
| 2 | gpt-oss-20b | Native function calling, web/python tools. |
| 3 | Qwen3 Coder 30B A3B | Native function-call format, designed for agent loops. |
| 3 | Mistral Small 3.2 24B | Mistral function-call template, vLLM tool parser. |
| 5+ | gpt-oss-120b | Production-grade tool use at frontier quality. |
Long context (100K+ tokens)
| Tier | First pick | Native context |
|---|---|---|
| 1 | Qwen3.5 4B | 256K (1M with YaRN) |
| 2 | Qwen3.5 9B | 256K (1M with YaRN) |
| 2 | Granite 4.1 8B | 128K, cheap KV cache (Mamba hybrid) |
| 3 | Qwen3 Coder 30B A3B | 256K native, 1M with YaRN |
| 3 | Qwen3.6 35B A3B | 256K native, 1M with YaRN |
| 4+ | Devstral Small 2 24B | 256K |
Phi-4 14B is only 16K context; do not pick it for long-document workloads.
What this doc deliberately does not include
- Uncensored / abliterated / “heretic” variants. They show up high in HuggingFace trending but I want this list to be reproducible.
- Hobbyist frankenmerges (anything labeled “X-Distilled-GGUF” from non-vendor authors). The base models above are the inputs to most of those; pick the base unless you have a specific reason.
- Pure base models (no
-Instruct/-itsuffix). LLMKube users almost always want instruct variants. - Embedding / reranker GGUFs. Those belong in a separate matrix for RAG infrastructure.
- GGUFs marked
imatrix-only (calibration files, not runnable models). - CPU-only inference benchmarks. Possible but typically 1 to 5 tok/s for a 7B; this doc assumes GPU or Apple Silicon.
Methodology and sources
Every file size in this doc was fetched from the HuggingFace tree API on 2026-05-11. Capability claims (context length, license, benchmarks) come from the upstream meta-llama/, Qwen/, mistralai/, google/, microsoft/, openai/, deepseek-ai/, and ibm-granite/ model cards, not from third-party quanters. The Unsloth and Bartowski GGUF repos are referenced because they are the most-downloaded community quants with reproducible naming, but the underlying weights and licenses follow the upstream publisher.
When in doubt, prefer Q4_K_M as the starting quant, measure quality on your own task, and step up to Q5_K_M or Q6_K only if you see issues. Going below Q4 (IQ3, Q2, IQ2, IQ1) trades real quality for memory; only do it if the model otherwise will not fit.
A reminder that this will go stale
The local LLM ecosystem is one of the fastest-moving spaces in software. Between when this was written and when you are reading it, expect that:
- New model generations (Llama 4.x, Qwen3.7, Gemma 5, etc.) have probably shifted the recommendations.
- Quanters have re-uploaded files with better imatrix calibration; sizes shift by a few percent.
- Tools like llama.cpp and llmkube have added new quant types, faster KV cache modes, or runtime backends.
- License terms occasionally change (especially Llama and Gemma).
Use this as a structured starting point. Always click through to the upstream HuggingFace model card before downloading a large file. If you find this doc is meaningfully wrong, a PR with the corrected row is appreciated, but please do not assume any single number here is current the day you read it.