Under the Hood — 4.2
Every AI model you'll encounter in a procurement conversation has a number attached to it. Seven billion. Seventy billion. Four hundred and five billion. The "B" is doing two jobs at once: it's a rough proxy for what the model can do, and it's a precise determinant of what it costs to run. Understanding both jobs — and where they diverge — is what separates a confident answer from a bluff.
What a Parameter Actually Is
A parameter is a number. More precisely, it's a floating-point value stored in the model's weight matrices: the numerical residue of everything the model learned during training. When a model processes your prompt, it multiplies the input through layer after layer of these matrices, each multiplication shaped by billions of individual weight values. The output — the next token, the completion, the answer — is what emerges from that computation in aggregate.
Parameters are coefficients of a learned function, not stored facts or explicit rules the model follows. No single parameter "knows" anything. The knowledge, such as it is, lives in the pattern of billions of them together.
That's the version you can say out loud in a CIO conversation without bluffing. The reader who wants more precision should start with the original transformer paper. For a live meeting, this is enough.
• Parameters: Floating-point values stored in a model's weight matrices — the numerical residue of everything the model learned during training. When the model runs, it multiplies your input through billions of these values; the output is what emerges from that computation.
What the Numbers Mean for Hardware
The math connecting parameter count to hardware requirements is almost embarrassingly direct. Each parameter stored in standard 16-bit floating-point format takes 2 bytes of memory. Multiply the parameter count by 2 and you have the approximate VRAM floor just to load the model.
- 7B model: roughly 14GB of VRAM. A single high-end consumer GPU handles this.
- 70B model: roughly 140GB of VRAM. That's two NVIDIA H100s at 80GB each, running in tandem, before you've processed a single token.
- 405B model: roughly 810GB of VRAM. You're in cluster territory. This is not a workstation conversation.
These are floors. Running inference — actually generating output — requires additional memory for the context window, attention computations, and output buffers. Real-world VRAM requirements run 10–20% higher than the load-the-model figure. But the load figure is what determines whether the hardware conversation is even possible.
The capability relationship is real but nonlinear, and the nonlinearity is worth understanding before the quantization section makes it concrete. Larger models generally handle more complex reasoning, longer context, and more nuanced instruction-following than smaller ones. A 70B model will outperform a 7B on most tasks requiring sustained logical coherence or domain depth. A 405B model extends that further. The pattern holds directionally. It does not hold proportionally, and that distinction matters more than it might seem.
• Parameter count: The total number of weight values in a model, measured in billions, which directly determines how much GPU memory the model requires to run. A 70B model at standard precision needs roughly 140GB of VRAM — a hardware floor, not a suggestion.
Quantization: Fitting the Model on the Hardware You Have
The local-LLM community solved a real problem here, and the solution changes the procurement math significantly.
Quantization reduces the bit-width of each stored parameter. Instead of 16 bits per weight, you store 8 bits (INT8) or 4 bits (INT4). The memory reduction is proportional:
- 16-bit to 8-bit: roughly 2x reduction in VRAM
- 16-bit to 4-bit: roughly 4x reduction in VRAM
Applied to a 70B model: full precision requires ~140GB. Quantized to 8-bit, that drops to ~70GB. Quantized to 4-bit, it drops to ~35GB. A single H100 80GB can now run a 70B model that previously required two of them. A pair of consumer-grade GPUs with 24GB each can run it in a configuration that would have been implausible at full precision.
The quality cost is real but smaller than most people expect. For standard language tasks — summarization, question answering, instruction following, code generation — 4-bit quantization typically costs 1–3 percentage points on benchmark evaluations. For tasks requiring precise numerical computation or very fine-grained reasoning chains, the loss is more pronounced. Published evaluations from groups like EleutherAI and Hugging Face's open LLM leaderboard show this pattern consistently, though exact numbers vary by model architecture and quantization method.
The result worth holding onto: the quality loss from quantizing a large model is usually smaller than the capability gap between a large model and a small one. A 70B model quantized to 4-bit frequently outperforms a 7B model running at full 16-bit precision on the same task. You gave up some precision on the large model and still came out ahead.
This isn't a universal law. It depends on the task, the model family, and the quantization implementation. But it holds often enough that "we can't afford the hardware for the big model, so we'll use the small one" is not always the right trade. Sometimes the right answer is "we'll quantize the big model and run it on the hardware we have."
• Quantization: A compression technique that reduces the bit-width of each parameter, shrinking the model's memory footprint by 2x to 4x with surprisingly small quality loss for most tasks. A 70B model quantized to 4-bit fits in roughly the same space as a full-precision 7B model, and often outperforms it.
When This Conversation Happens
The scenario where this matters most in a federal or enterprise context is the air-gapped deployment question. An agency wants to run AI locally — no cloud, no external API calls, data never leaves the enclave. The first question is capability: what model is good enough for this use case? The second question is immediate: what hardware do you have?
If the answer is a single A100 80GB, a quantized 70B at 4-bit is feasible. Full precision isn't. If the answer is "we have four A100s in a rack," a full-precision 70B becomes possible, and a quantized 405B enters the conversation. The parameter count is the number that turns a capability discussion into a procurement discussion, and understanding the quantization lever means you can have that conversation without immediately hitting a wall.
Buyers will ask some version of: "The vendor's demo ran fine, why does our deployment cost so much more?" The answer is almost always parameter count. Demos run small models. Production requirements, especially for complex government use cases, often require large ones. The gap between 7B and 70B is not a 10x cost increase in every dimension, but it is a 10x VRAM increase, which means the difference between a single GPU and a multi-GPU configuration with all the orchestration complexity that implies.
• Hardware floor: The minimum GPU memory required to load and run a model, set almost entirely by parameter count and precision. This is the number that turns a capability conversation into a procurement conversation.
Okta Concept Mapping
The Policy Engine Bridge — and Where It Breaks
The closest IDAM analogy for parameter count is policy engine rule count. More rules mean more complexity, more compute to evaluate, and a higher operational cost to maintain — the same directional relationship that holds between parameter count and hardware requirements. If you've ever scoped a policy engine deployment and had to account for evaluation latency as rule sets grew, you have the right intuition for why a 405B model costs more to run than a 7B.
The analogy breaks at capability scaling. Policy engine performance tracks rule count and rule quality in a legible way — add better rules, get better outcomes. Model capability does not scale linearly with parameter count. A quantized 70B model frequently outperforms a full-precision 7B on the same task. There is no policy engine equivalent: you cannot compress a 70-rule policy to 17 rules and have it outperform a 7-rule policy. The compression-capability relationship in language models is fundamentally different, and a buyer who assumes "smaller model equals proportionally less capable" will make bad procurement decisions because of it.
The One Thing to Hold Onto
Parameter count sets the hardware floor. Quantization moves that floor down, sometimes dramatically, with quality loss that is real but often smaller than the capability gap between model sizes. A compressed large model frequently beats an uncompressed small one.
When a buyer asks why their AI deployment costs what it costs, the answer starts with the B. When they ask whether they can run it on the hardware they have, the answer starts with quantization. These are the two levers, and everything else in the model-size conversation follows from understanding them.

