When your buyer says they're "deploying a model," the physical act is this: copying a file onto a machine with enough GPU memory to hold it. The file contains billions of numbers (called weights) that encode everything the model learned during training. The file has a size, measured in gigabytes. The GPU has a memory ceiling, called VRAM. If the file fits, the model runs. If it doesn't, it doesn't. This piece profiles four current models as physical specimens: Google's Gemma 4 26B, Meta's Llama 4 Scout(109B) and Llama 4 Maverick(400B), and DeepSeek R1(671B). Know what each one weighs, what hardware holds it, and why the number on the label doesn't always predict the number on the invoice, and you can hold your own when a buyer starts talking about on-prem inference.
One Formula, Every Deployment Decision
One multiplication runs the whole deployment conversation. Each parameter in a model is stored at a precision level that determines how many bytes it occupies:
- Full precision (BF16): 2 bytes per parameter
- 8-bit (INT8): 1 byte per parameter
- 4-bit (Q4): 0.5 bytes per parameter
Parameters × bytes-per-parameter = weight file size. A 26-billion-parameter model at full precision: 26B × 2 = 52GB. At 4-bit: 26B × 0.5 = 13GB. That's the raw weight storage. Actual VRAM usage runs 10–30% higher because the model needs working space at runtime: context buffers, activation memory, framework overhead. So a file that's 13GB on disk needs roughly 17–18GB of VRAM once loaded. The file size gets you in the right neighborhood. The VRAM figure tells you whether it actually fits.
Reducing precision is called quantization. Lossy compression for model weights: smaller file, slightly fuzzier output. How much fuzzier depends on the model and the task. More on that below.
Gemma 4 26B A4B
- What it is: Google DeepMind's mid-size open-weight model, released April 2026 under Apache 2.0, the most permissive license in this lineup.
- What it does: General-purpose text and multimodal inference with a 256K token context window. Fast enough for interactive use on consumer hardware.
- Who's behind it: Google DeepMind. Part of the Gemma family, Google's open-weight counterpart to their proprietary Gemini line.
- What makes it distinct: It's a Mixture of Experts (MoE) architecture. 26 billion total parameters, but only about 4 billion activate per token. You pay the VRAM cost of 26B but get the speed of 4B. At Q4, the model needs roughly 17–18GB of VRAM, which means an RTX 4090 (24GB VRAM) handles it with headroom. At full precision, about 52GB, fitting on a single H100. This is the model that actually runs on hardware a person might own.
Llama 4 Scout (109B)
- What it is: Meta's MoE model with 109 billion total parameters, 17 billion active per token, using 16 expert sub-networks. Released under Meta's Llama Community License, a custom license with usage thresholds. Not OSI-approved open source, despite the open weights.
- What it does: Multimodal inference with a 10-million-token context window. Using that full context requires enormous VRAM for the key-value cache alone, so the headline number is more theoretical than operational.
- Who's behind it: Meta. Released with open weights on HuggingFace.
- What makes it distinct: Scout is where the "just count the active parameters" intuition starts to mislead. The 17B active parameters sound laptop-friendly. They are not. All 109B must sit in VRAM because the router can send any token to any expert, and you can't predict which experts get called next. At INT4, that's roughly 55GB. Meta's documentation says it fits on a single H100 at INT4. No consumer GPU can run it at any reasonable quality.
Llama 4 Maverick (400B)
- What it is: Meta's larger MoE model. 400 billion total parameters, still 17 billion active per token, but with 128 experts instead of Scout's 16. Same Llama Community License as Scout.
- What it does: Same general capabilities as Scout with a deeper expert pool. 1-million-token context window.
- Who's behind it: Meta. Released in both BF16 and FP8 quantized weights.
- What makes it distinct: Maverick and Scout activate the same 17B parameters per token, so they generate text at similar speeds. The difference is that Maverick carries 128 experts instead of 16, which means the VRAM bill is roughly 4× Scout's. At Q4, you need approximately 200GB, which means 3× H100 80GB GPUs minimum. At full precision, 800GB. Datacenter hardware, full stop.
DeepSeek R1 (671B)
- What it is: DeepSeek's reasoning-focused MoE model. 671 billion total parameters, roughly 37 billion active per token. MIT license.
- What it does: Chain-of-thought reasoning. It works through problems step-by-step before answering, which produces stronger results on math, code, and structured analysis at the cost of longer outputs and higher token consumption.
- Who's behind it: DeepSeek, a Chinese AI lab. The model made headlines in early 2025 when the weights were released publicly.
- What makes it distinct: At full precision, the weight file is 1.34 terabytes. At Q4, roughly 376GB, which means 5–6 H100s. Nobody runs the full R1 on anything resembling normal hardware. What people actually deploy locally are DeepSeek's distilled models: smaller dense models trained from scratch using the full R1 as a teacher. Distillation and quantization do different things. Quantization compresses the same model into a smaller file. Distillation creates a different model entirely, with its own weights, its own architecture, its own error profile. The 32B distill at Q4 fits in about 20GB of VRAM, putting it on an RTX 4090. When someone says they're "running DeepSeek locally," this is almost certainly what they mean.
What Each Model Actually Requires
Parameter count fills the spec sheet. Hardware tier fills the purchase order. Clustering these models by what they actually require to run is the most useful frame for a buyer conversation, because the buyer's next question after "which model?" is always "on what?"
| Model | Total Params | Active Params | VRAM at Q4 | VRAM at BF16 | License |
|---|---|---|---|---|---|
| Gemma 4 26B | 26B | ~4B | ~17–18GB | ~52GB | Apache 2.0 |
| Llama 4 Scout | 109B | ~17B | ~55GB | ~218GB | Llama Community |
| Llama 4 Maverick | 400B | ~17B | ~200GB | ~800GB | Llama Community |
| DeepSeek R1 | 671B | ~37B | ~376GB | ~1,340GB | MIT |
| DeepSeek R1 32B distill | 32B | 32B (dense) | ~20GB | ~64GB | MIT |
Tier 1: Single Consumer GPU (16–32GB VRAM, e.g., RTX 4090 or RTX 5090)
Gemma 4 26B at Q4 (~17–18GB VRAM) and the DeepSeek R1 32B distill at Q4 (~20GB). That's it. Everything else in this article exceeds what consumer hardware can hold.
Tier 2: Single Datacenter GPU (80GB VRAM, e.g., H100 or A100)
Gemma 4 26B at full precision (~52GB, with headroom). Llama 4 Scout at INT4 (~55GB, tight fit). One card, one model, one procurement conversation.
Tier 3: Multi-GPU Server (160–400GB VRAM, 2–5× H100)
Scout at full precision needs roughly 4× H100. Maverick at Q4 needs 3× H100. The full DeepSeek R1 at Q4 needs 5–6× H100. This tier is where you start talking to cloud providers about reserved instances.
Tier 4: Large Cluster (700GB+ VRAM, 7+ H100/H200)
Maverick at full precision (~800GB). DeepSeek R1 above Q4. An H100 runs about $2.50/hour on-demand as of early 2026. A 6-GPU R1 deployment costs roughly $15/hour for compute alone, before serving a single token.
The MoE Complication
The hardware tiers above reflect a pattern worth naming explicitly, because it trips up everyone the first time.
In a dense model, every parameter participates in every computation. Bigger model, more compute, more memory, slower output. Linear and predictable. And irrelevant to most of the models in this piece, because they use MoE.
MoE models contain multiple "expert" sub-networks. A router picks which experts handle each token. The rest sit idle for that token.
First, all experts must be loaded into VRAM because the router can potentially call any of them on any token. You can't swap experts in and out on demand without catastrophic latency. Total parameters set your VRAM floor.
Second, active parameters determine speed. Maverick and Scout both activate 17B parameters per token. They generate text at roughly the same rate. But Maverick needs 4× the VRAM because it carries 8× the experts.
So: total parameters tell you what hardware to buy. Active parameters tell you how fast it runs. When a buyer asks "how big is this model," you want to know which kind of big they mean: the kind that determines speed, or the kind that determines cost.
Model files work like IDAM components: the DeepSeek R1 weight file is byte-for-byte identical whether it runs on DeepSeek's servers or on an air-gapped GovCloud instance. The data path changes, the jurisdiction changes, the logging changes. The model doesn't. When a buyer asks about "self-hosting" a model, they're asking the same question your IDAM buyers ask about on-prem federation: where does my data go?
What Quantization Actually Costs You
Quantization is cheap but not free. The practical impact varies enough by model that generalizations are worse than useless.
For Gemma 4 26B at Q4, benchmarks across 250K tokens show that long documents and non-Latin scripts degrade fastest. For most English-language tasks, Q4 output is close enough to full precision that you'd struggle to tell the difference blind. The MoE architecture actually helps here: because only a fraction of parameters activate per token, the model tolerates quantization better than a dense model of equivalent total size.
For DeepSeek R1's distilled models, Q4_K_M introduces under 1% accuracy loss on STEM benchmarks versus full precision, with about 3.5% on broader evaluations. Reasoning models are more sensitive to quantization than chat models because chain-of-thought depends on precise intermediate calculations. The full 671B model shows 3–6 point dips on knowledge tasks at Q4, with more visible degradation on math and structured reasoning.
For Llama 4 Scout and Maverick, Meta's own FP8 release of Maverick suggests they consider 8-bit the quality floor for serious use. Community testing confirms INT4 is workable for document processing and RAG but noticeably weaker on reasoning tasks.
For buyer conversations: Q4 is the "good enough for most production work" line. Q8 is the "I care about reasoning accuracy" line. Full precision is the "money is no object" line.
IDAM practitioners already understand capability tradeoffs by environment. A lightweight MFA agent on a mobile device doesn't offer the same policy granularity as a full PAM gateway. Quantization applies the same logic: a Q4 model on a 24GB consumer GPU is the field-deployable lightweight agent, while the full-precision model on an H100 cluster is the enterprise gateway. The buyer question is "what does this environment need to do, and what's the minimum precision that supports it?"
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "Models are really complex software systems" | "A model is a file. Gemma 4 at Q4 needs about 18GB of GPU memory. You load it and it runs." | Makes deployment concrete. |
| "You need a lot of compute to run AI" | "Gemma 4 fits on a $1,600 graphics card. DeepSeek R1 needs a cluster that costs $15/hour to rent." | Anchors cost to specific hardware. |
| "Bigger models are better" | "Maverick and Scout process tokens at the same speed because they activate the same 17B parameters. Maverick just needs 4× the memory." | Shows you understand MoE, which most buyers don't. |
| "Quantization reduces quality" | "At 4-bit, quality loss typically runs 1–3% on standard benchmarks. The tradeoff is real but small for most production tasks." | Prevents the buyer from assuming quantized means broken. |
| "Open-source models are free" | "The weights are free to download. The license varies — Gemma is Apache 2.0, Llama has a custom license with usage limits. And the GPU to run them is not free." | Separates software cost from infrastructure cost and flags license differences. |
| "You can run any model locally" | "Anything under 30B parameters at Q4 fits on a single high-end GPU. Above that, you're into datacenter hardware or cloud instances." | Gives the buyer a concrete planning threshold. |
| "DeepSeek R1 is a 671-billion-parameter model" | "The full R1 is 671B and needs a GPU cluster. What most people actually run is the 32B distilled version — a different, smaller model — which fits on one GPU." | Distinguishes the headline from the deployment reality. |
| "Self-hosting gives you more control" | "Self-hosting means the weight file runs on your hardware, so inference data never leaves your network. The model itself is identical either way." | Frames self-hosting as a data-path decision. |
| "The model needs a GPU" | "The model needs enough VRAM to hold its weight file plus about 20% overhead. An 18GB model needs a GPU with at least 24GB of VRAM." | Shows you understand the actual constraint. |
| "MoE models are more efficient" | "MoE models are faster per token, but they still load all parameters into memory. Scout needs 109B in VRAM even though only 17B activate per token." | Prevents the most common MoE misconception. |
VRAM is a hard limit. If the model file exceeds available VRAM, the system spills weights to system RAM at roughly 1/12th the bandwidth. This resembles how an IDAM deployment that exceeds its connection pool doesn't gracefully degrade — it falls off a cliff. When a buyer says they want to "start small and scale up," the honest response is that model deployment jumps between discrete hardware tiers. There's no smooth gradient between a $1,600 GPU and a $15/hour cluster.
Things to follow up on...
- Red Hat's quantization study: Red Hat ran over half a million evaluations on quantized LLMs and confirmed that 4-bit and 8-bit models hold up across production workloads, which is the most rigorous public validation of the "Q4 is good enough" claim we reference here.
- Unsloth's GGUF quality benchmarks: Independent KL divergence testing across 80 quantized versions of Gemma 4 26B found that long documents and non-Latin scripts degrade fastest at every quantization level, a finding that matters for any multilingual or document-heavy deployment.
- DeepSeek on AWS Bedrock: AWS was the first cloud provider to offer DeepSeek R1 as a fully managed serverless model in Bedrock, which makes the "same file, different data path" distinction from this piece immediately concrete for buyers evaluating Chinese-origin models.
- The OSI open source definition: The Open Source Initiative's OSAID v1.0 requires training data, code, and weights for a model to qualify as open source AI, which means almost none of the models in this piece actually pass the bar despite being widely called "open source."

