Every large language model gets described by a single number: 7B, 70B, 405B. That number is doing two jobs simultaneously, and most conversations about AI models treat them as separate when they're not. The parameter count tells you roughly how capable the model is, and it tells you the minimum hardware required to run it. You cannot have one without the other. The same property that makes a model smarter makes it heavier.
If you're going into a conversation with a CIO who wants to know why their agency can't just "run the big model locally," this is the concept you need to hold.
What Parameters Actually Are
A parameter is a number. Specifically, it's a floating-point value — a decimal — that gets stored in memory and used in arithmetic during every inference pass. When a model "generates" text, what's actually happening is a cascade of matrix multiplications: the input gets transformed repeatedly through layers of these weights, and the output emerges from the final transformation.
The model doesn't "know" things the way a database knows things. There's no lookup table of facts. The knowledge is distributed across billions of these numerical weights, encoded in their specific values after training on enormous text corpora. Change the values, and you change what the model knows and how it reasons. The weights are the model. Everything else — the architecture, the tokenizer, the inference runtime — is scaffolding.
A 70B model has 70 billion of these numbers. A 405B model has 405 billion. More parameters means more capacity to encode relationships, more nuance in reasoning, higher ceiling on complex tasks. It also means more numbers to store, more arithmetic to run, and more memory required to hold the whole thing in a state where inference is possible.
- Parameters: The numerical weights that encode a model's learned knowledge — 70B means 70 billion floating-point values that must be loaded into memory and used in computation during every inference pass.
The Arithmetic You Need to Follow the Conversation
The hardware floor comes directly from the math, and the math is simple enough to do in your head once you have the pieces.
Each parameter is stored as a floating-point number. The standard precision for model inference is 16-bit floating point, which takes 2 bytes of memory per value. That's the baseline — not a compressed format, just the ordinary way models get loaded for use.
Take a 70B model at 16-bit precision:
70,000,000,000 parameters × 2 bytes = 140,000,000,000 bytes = ~140 GB of VRAM
VRAM, not RAM. The model has to live in GPU memory during inference because CPUs are too slow for the matrix math at any useful throughput. RAM doesn't substitute.
Now translate that into physical hardware. A high-end data center GPU — an NVIDIA H100 or A100 — carries 80GB of VRAM. One card holds 80GB. A 70B model at full 16-bit precision needs 140GB, so you need at least two of those cards working together, with the model split across them. Consumer GPUs top out around 24GB (the RTX 4090 as of this writing). Running a 70B model in full precision on consumer hardware means six of those cards, minimum, with the overhead of coordinating them.
This is why "just run the big model" is not a procurement decision — it's a data center decision.
A 405B model at the same precision requires roughly 810GB of VRAM. That's ten H100s. The number after the B isn't a marketing figure; it's a hardware specification.
Quantization changes this arithmetic.
Quantization is the technique of reducing the numerical precision of each stored weight. Instead of 16 bits per parameter, you store 8 bits (INT8) or 4 bits (INT4). The math:
- 8-bit: 1 byte per parameter → 70B model = ~70 GB (fits on one H100, or three consumer cards)
- 4-bit: 0.5 bytes per parameter → 70B model = ~35 GB (fits on two consumer cards)
That's how the local-LLM community runs 70B models on a gaming PC. Not magic — arithmetic.
The obvious question is what you lose. The honest answer: less than you'd expect for most tasks, more than you'd want for some. Quantization introduces small rounding errors in each weight. For tasks like summarization, question-answering, and code generation, published evaluations consistently show 4-bit quantized models performing within a few percentage points of their full-precision counterparts on standard benchmarks. For tasks requiring precise numerical reasoning or very long chains of logical inference, the degradation is more noticeable. The quality loss is real; it's just not catastrophic for the majority of enterprise use cases.
The technique is called post-training quantization, and the name is worth knowing. The model is trained at full precision, then the weights are compressed afterward. The compression is lossy, but neural networks are surprisingly tolerant of small approximation errors because no single weight is load-bearing. The knowledge is distributed.
- Hardware floor: A 70B model at standard 16-bit precision requires ~140GB of VRAM — roughly two high-end data center GPUs. Quantization to 4-bit reduces this to ~35GB, enabling the same model to run on consumer-grade hardware with limited quality loss for most tasks.
- Quantization: A compression technique that reduces per-parameter storage from 2 bytes (16-bit) to 1 byte (8-bit) or 0.5 bytes (4-bit), cutting VRAM requirements proportionally at the cost of minor precision loss.
IDAM Concept Mapping: Key Length and Its Limits
The closest analog in your existing toolkit is cryptographic key length. A 256-bit AES key is more secure than a 128-bit key, and it also requires more compute to process — bigger means more capable and more expensive, simultaneously. Parameter count works the same way: the number that sets the capability ceiling is the same number that sets the hardware floor. You cannot optimize one without affecting the other.
Here's where the analogy stops bearing weight. In cryptography, you cannot lossy-compress a private key. The math requires exact values — a key with any bits altered is simply wrong, and the operation fails. Neural network weights are fundamentally different. They're not doing exact computation; they're doing weighted approximation across billions of values, and that process is robust to small rounding errors. Quantization works precisely because no individual weight is load-bearing. Applying the same intuition to model weights that you'd apply to a private key would lead you to expect quantization to destroy model quality. It doesn't, for most tasks, and understanding why requires letting go of the cryptographic frame entirely.
When This Comes Up in an Actual Conversation
On-premises conversations in public sector accounts are where this lands hardest. An agency wants to run a capable model inside their boundary — air-gapped environment, data sovereignty requirement, FedRAMP authorization that doesn't extend to a cloud inference endpoint. What hardware do we need?
That question starts with parameter count. If the agency wants a model in the 70B range — capable enough for complex document analysis, policy summarization, or code review — they need to know whether they're buying for full precision or quantized. Full precision means data center hardware: multiple high-end GPUs, the power and cooling infrastructure to support them, the operational overhead of running a small GPU cluster. Quantized means the hardware floor drops significantly, potentially to a single server with one or two high-end cards.
Neither answer is wrong. The tradeoff is hardware cost and operational complexity against the quality degradation from quantization — which, again, is limited for most tasks but real for some. The actual question is what the model will be doing. Summarizing meeting transcripts and drafting correspondence? 4-bit quantization is probably fine. Running complex multi-step reasoning over classified technical documents where precision matters? Worth understanding the benchmark data for the specific model and task type before committing to a quantized deployment.
Vendor conversations are the other place this surfaces. When a vendor tells you their model "runs on a single GPU," the first question is: which model, at what precision, on which GPU? A 7B model at 4-bit quantization runs on a laptop. A 70B model at 4-bit runs on a workstation. A 70B model at full precision does not run on a single consumer GPU under any circumstances. The number after the B, combined with the precision, tells you whether the claim is plausible before you've seen a demo.
You don't need to run the arithmetic in the meeting. You need to know that the arithmetic exists, that it's simple, and that anyone making hardware claims about model deployment should be able to show it to you.
- The practical question: When a buyer asks about on-premises deployment, parameter count and quantization level together determine the hardware specification. A 70B model at 4-bit quantization needs roughly 35GB of VRAM; the same model at full 16-bit precision needs 140GB. These are different procurement conversations.

