Local LLM VRAM Guide: Fitting Models on Consumer GPUs
Written by PltOps @ #B4mad Industries โ February 2026 Context: Investigation of inference timeouts on our local Ollama setup (bead
beads-hub-b4u)
TL;DR
An 80B dense model needs ~51GB VRAM. Our RTX 4090 has 24GB. The overflow spilled to CPU RAM, causing crippling timeouts. We switched to a Mixture-of-Experts (MoE) model (qwen3-coder:30b-a3b-q4_K_M) that fits in 18GB and activates only ~3B parameters per token. Problem solved.
Our GPU Setup
| Component | Spec |
|---|---|
| GPU | NVIDIA RTX 4090 |
| VRAM | 24 GB GDDR6X |
| Host RAM | 128 GB DDR5 |
| Inference server | Ollama |
| OS | Linux (WSL2) |
Why the 80B Model Failed
The Math
For a dense transformer model with Q4 quantization:
VRAM โ (params ร bits_per_param) / 8 + KV_cache + overhead
80B ร 4 bits / 8 = 40 GB (weights alone)
+ KV cache (~8GB at 8K context) = ~48 GB
+ CUDA overhead (~3 GB) = ~51 GB total
Our 24GB card can hold ~22GB of model weights (after reserving for KV cache and overhead). That means ~58% of the model stays in VRAM, and the remaining ~42% spills to system RAM.
What CPU Spillover Looks Like
When a model partially offloads to CPU:
- Each forward pass shuttles tensors across the PCIe bus (16 GB/s vs 1 TB/s GPU memory bandwidth)
- Inference slows by 10-50ร for the offloaded layers
- Ollama’s default timeout fires โ request fails
- The GPU sits partially idle waiting for CPU layers to complete
This is exactly what we observed: intermittent timeouts, high CPU usage during inference, and GPU utilization never hitting 100%.
The Investigation
Options Considered
| Option | VRAM Needed | Trade-off |
|---|---|---|
| 80B Q4_K_M (original) | ~51 GB | Way too large |
| 80B Q2_K | ~25 GB | Fits barely, severe quality loss |
| 70B Q3_K_S | ~30 GB | Still too large |
| 32B Q4_K_M (dense) | ~20 GB | Fits, but fewer params = less capability |
| 30B MoE Q4_K_M | ~18 GB | Fits with headroom, 30.5B total / ~3B active |
Why We Didn’t Just Use a Smaller Dense Model
A dense 32B model activates all 32B parameters for every token. A 30B MoE model has 30.5B total parameters but routes each token through only ~3B of them (the “active” experts). This means:
- Knowledge capacity comparable to a much larger model (experts specialize)
- Inference cost comparable to a 3B model (only active params compute)
- Best of both worlds for VRAM-constrained setups
Why MoE Wins for Constrained VRAM
How Mixture-of-Experts Works
Input Token
โ
โโโโโโโโโโโ
โ Router โ โ Learned gating network
โโโโโโโโโโโ
โ selects top-k experts (typically 2)
โโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโ
โ Exp1 โ Exp2 โ Exp3 โ ... โ โ Only selected experts compute
โโโโโโโโดโโโโโโโดโโโโโโโดโโโโโโโ
โ weighted sum
Output
- All expert weights live in VRAM (you pay full storage cost)
- Only 2 experts run per token (you pay minimal compute cost)
- Result: big model knowledge, small model speed
The Key Insight
VRAM stores the full model, but compute scales with active parameters. For a 30B MoE with ~3B active:
- Storage: 18 GB (all experts in VRAM)
- Compute per token: equivalent to a ~3B dense model
- Tokens/second: fast (limited by the ~3B active path, not 30B)
Our Final Configuration
Model: qwen3-coder:30b-a3b-q4_K_M
Total params: 30.5B
Active params per token: ~3B
Quantization: Q4_K_M (4-bit, k-quant mixed)
VRAM usage: ~18 GB
Remaining VRAM: ~6 GB (KV cache + overhead)
Context window: 8K default (expandable with VRAM headroom)
VRAM Budget Breakdown
Total VRAM: 24.0 GB
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Model weights: 15.5 GB (30.5B ร 4 bits / 8, compressed)
KV cache (8K ctx): 1.5 GB
CUDA context: 0.8 GB
Ollama overhead: 0.2 GB
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Used: ~18.0 GB
Free: ~6.0 GB โ buffer for longer contexts
How to Check Your Own Setup
1. Check Available VRAM
nvidia-smi
# Look for "MiB" columns โ total and used
2. Check Model Size Before Pulling
# See model details on ollama.com or:
ollama show <model> --modelfile
# Look for the parameter count and quantization
3. Estimate VRAM Requirement
# Quick formula for Q4 quantization:
# VRAM (GB) โ params_in_billions ร 0.5 + 2 (overhead)
# Examples:
# 7B โ ~5.5 GB
# 13B โ ~8.5 GB
# 30B โ ~17 GB
# 70B โ ~37 GB
4. Monitor During Inference
# Watch GPU usage in real-time:
watch -n 1 nvidia-smi
# Check if Ollama is offloading to CPU:
# Look for "offloading X layers to CPU" in Ollama logs
journalctl -u ollama -f
# or
ollama logs
5. Check Layer Offloading
If nvidia-smi shows VRAM maxed out and CPU usage is high during inference, layers are being offloaded. This is the #1 cause of slow local LLM performance.
Future Considerations
- RTX 5090 (32GB): Would allow larger MoE models or dense 32B at full context
- Multi-GPU: Ollama doesn’t natively split across GPUs well; vLLM or llama.cpp can
- Better quantizations: As Q4_K_M evolves (GGUF improvements), quality per bit improves
- Longer context: Our 6GB headroom allows ~16K context; for 32K+ we’d need a smaller model or more VRAM
- VRAM-efficient attention: Flash Attention and paged KV cache (vLLM) reduce KV cache footprint