The Formula That Sets the Hardware Floor

Parameter count times bytes per parameter sets your hardware floor. Quantization cuts that floor dramatically, with less quality loss than your deterministic instincts expect.

By Leigh Garrity— May 9, 2026

The Formula That Sets the Hardware Floor

Parameter count times bytes per parameter sets your hardware floor. Quantization cuts that floor dramatically, with less quality loss than your deterministic instincts expect.

Every model ships with a number. Llama 3.1 70B. Mistral 7B. DeepSeek R1 671B. That number is the parameter count, and it's doing more work in conversations than anyone stops to explain. It tells you roughly how capable the model is, roughly how much it costs to run, and exactly how much memory the machine needs before the model will even load.

Whether a model physically fits on a given machine comes down to arithmetic. One formula:

The Hardware Floor

Bytes per parameter × parameter count = memory required. Everything in this piece is context for those three terms.

What a parameter actually is

A parameter is a single numerical weight inside the model. During training, the model adjusts billions of these weights until they collectively encode patterns in language. Once training finishes, the weights are fixed. Inference (the part where the model answers your question) is matrix multiplication across all of them, every time.

A 70-billion-parameter model has 70 billion of these numbers. Each one must be stored in memory and read during every forward pass. The question is how many bytes each number occupies.

Bytes per parameter

In standard full precision (FP16, 16-bit floating point), each parameter takes 2 bytes. The math:

7B model at FP16: 7 billion × 2 bytes = 14 GB
70B model at FP16: 70 billion × 2 bytes = 140 GB
405B model at FP16: 405 billion × 2 bytes = 810 GB

A high-end consumer GPU like the NVIDIA RTX 4090 has 24 GB of VRAM. A single H100 datacenter GPU has 80 GB. So a 70B model at full precision doesn't fit on one H100. It doesn't fit on three RTX 4090s. You need multiple datacenter GPUs just to load it into memory, before it processes a single token.

Quantization exists to solve this.

Quantization compresses the first term

Quantization reduces bytes per parameter by storing weights at lower numerical precision. Instead of 16-bit floats (2 bytes each), you store them as 8-bit integers (1 byte) or 4-bit integers (0.5 bytes). For a 70B model:

Precision	Bytes per param	70B model memory	Hardware implication
FP16 (full)	2.0	~140 GB	Multiple datacenter GPUs
INT8 (8-bit)	1.0	~70 GB	One high-end datacenter GPU
INT4 (4-bit)	0.5	~35 GB	Two consumer GPUs or one workstation card

Same model, same 70 billion parameters, one-quarter the memory.

The obvious question: what breaks?

What quantization costs

Remarkably little, given what you're doing to the numbers.

If your background is in deterministic systems, reducing numerical precision sounds reckless. And in deterministic systems, it would be. But LLM inference is probabilistic. The model generates probability distributions over tokens. It was never computing exact answers. The tolerance for imprecision is fundamentally different.

Red Hat published evaluations across over half a million benchmark runs on quantized models, covering standard academic benchmarks, coding tasks (HumanEval), and conversational evaluation (Arena-Hard). Their findings at both precision levels:

8-bit quantization is essentially lossless. Across every model and benchmark Red Hat tested, INT8 models held above 99% accuracy recovery. If memory budget allows double the 4-bit cost, 8-bit is the safe default with no meaningful quality discussion to have.

4-bit is where it gets interesting. Quantized models at INT4 still consistently achieved over 99% accuracy recovery compared to full-precision baselines. The one documented outlier was a specific 8B model at INT W4A16, which dropped to 97.33%. Everything else held above 99%.

Their long-context evaluation using NVIDIA's RULER benchmark showed similar results: 99.2% accuracy recovery for the 8B model and 98.6% for the 70B model across sequence lengths from 4K to 64K tokens. Quality only degraded meaningfully at 128K tokens, where even unquantized models were already struggling.

These are published numbers on standard benchmarks from a documented evaluation methodology. The quality retention is genuinely surprising if you're calibrated to systems where precision is binary.

The practical default

If you've encountered anyone running models locally through Ollama or llama.cpp, they're almost certainly using a format called Q4_K_M. It's worth knowing what that means: primarily 4-bit precision, but using a smarter quantization method (the "K" family) that applies mixed precision across layers, quantizing sensitive components like attention projections less aggressively than the rest.

The result is approximately 4.5 bits per weight on average, not exactly 4.0. That extra half-bit of metadata per weight buys real quality. The llama.cpp project's own perplexity measurements show Q4_K_M adding only +0.0535 perplexity on a 7B model, compared to +0.2499 for naive 4-bit quantization (Q4_0). Roughly 5× less quality degradation for the same approximate bit-width. Below Q4_K_M, quality falls off a cliff: Q2_K shows +0.8698 perplexity, which is 16× worse. There's a reason Q4_K_M is the recommended default.

In practical terms: a 70B model at Q4_K_M fits in about 40 GB. A 7B model fits in about 4 GB.

The spillover penalty

When a model exceeds available VRAM, the system doesn't refuse to load it. It spills layers to system RAM and continues. The term for this is graceful degradation. The experience is a cliff.

Hardware	Memory type	Bandwidth	Spillover penalty
RTX 4090 (consumer)	GDDR6X VRAM	~1,008 GB/s	—
System RAM	Dual-channel DDR5	~100 GB/s	10× slower
H100 SXM (datacenter)	HBM3	~3,350 GB/s	—
System RAM (on H100 host)	DDR5	~100 GB/s	33× slower

Inference speed drops from 25+ tokens per second to low single digits. The model still produces output. It just produces it at a speed that makes interactive use painful and multi-user serving impossible. One misfit layer can tank throughput for an entire deployment.

"How much VRAM?" is the first question in any on-premises model deployment conversation for exactly this reason. Below the floor, the model is 10× slower. In a serving context, that's functionally broken.

The MoE complication

One wrinkle worth knowing: Mixture of Experts (MoE) models break the intuitive link between parameter count and compute cost. They do not break the link between parameter count and memory.

DeepSeek R1 has 671 billion total parameters but only activates 37 billion per token. A routing mechanism picks which expert subnetworks to use for each input. Computationally efficient, yes. All 671 billion parameters still need to reside in memory, though, because the router needs to be able to reach any expert on any forward pass. At FP8 precision, that's approximately 700 GB for the weights alone. You need an 8×H200 pod (1,152 GB of VRAM) to hold it in one node.

When someone tells you a model has 671B parameters but "only uses 37B," the memory bill is still 671B. Active parameters set the speed. Total parameters set the floor.

Where your IDAM intuition applies and where it misleads

The deployment constraint maps cleanly. An IdP that can't handle authentication load doesn't serve 80% of requests correctly. It queues, it times out, it fails. The VRAM floor works the same way. The model either fits or it doesn't, and the consequences of not fitting aren't graceful.

Here's where the analogy stops bearing weight. In identity, you would never ship a credential that's 98% accurate. A SAML assertion that's wrong 2% of the time is a security incident. But in LLM inference, 99% quality retention at 4-bit quantization is genuinely fine for most tasks. The system is probabilistic by design. Its outputs were never exact. Shaving precision off the weights introduces noise into a process that was already noisy, and the benchmarks show that noise is negligible for standard workloads.

Keep that distinction in your back pocket. When a buyer's technical team is evaluating whether to run a quantized model on-premises versus paying for full-precision API access, they're looking at 99% quality at one-quarter the memory cost versus 100% quality at four times the infrastructure. That's a procurement decision. And procurement decisions are a language you already speak.

Things to follow up on...

MoE memory versus compute: DeepSeek R1's 671B parameters require roughly 700 GB just for weights at FP8, but the KV cache for long contexts can exceed the model's own memory footprint, adding another dimension to the VRAM floor calculation.
Red Hat's long-context edge case: Their RULER benchmark evaluation found quantized models held above 99% accuracy recovery up to 64K tokens, but INT4 accuracy dropped to 85–88% at 128K sequence length, a threshold worth watching as agent workloads push context windows longer.
Q4_K_M is not exactly 4-bit: A January 2026 arXiv preprint studying GGUF quantization formats found that Q4_K_M's hierarchical super-block structure produces approximately 4.5 effective bits per weight, and that extra metadata is what separates it from naive INT4 in quality benchmarks.
The H200 memory jump: NVIDIA's H200 uses the same die as the H100 but upgrades to 141 GB of HBM3e at 4,800 GB/s bandwidth, which is why it's become the minimum viable single-node option for 671B-class models in production inference.