Recap: The Physical Reality of AI

By Leigh Garrity— May 9, 2026

You've just read through what a model physically is, what it needs to run, who controls it, what it costs, and how you modify it. This document doesn't re-teach any of that. It organizes it so you can find it again in thirty seconds before a call.

Every entry follows the same shape: what the term means, when it surfaces in a sales conversation, and what it gets confused with. The tables cover the vocabulary collision zones.

The File and Its Size

Parameters — The numerical values that encode everything the model learned during training. A 70B model has 70 billion of them. The file is these numbers. When it comes up: When a customer asks what they're actually downloading or storing when they "get" a model. Don't confuse with: Weights. In practice, used interchangeably. The technical distinction — weights are the parameters updated during training — rarely surfaces until a customer's ML team is in the room.

Quantization — Reducing the numerical precision of each parameter to shrink the file. A 70B model at full precision (fp16) runs roughly 140GB. Quantized to 4-bit, it drops to around 35GB. Same model, smaller file, slightly degraded output quality. When it comes up: When a customer asks whether they can run an open-weights model on their existing infrastructure. Don't confuse with: Compression in the traditional sense. Quantization is lossy in a probabilistic way — outputs shift, they don't corrupt.

Context window — The maximum text the model can process at once, measured in tokens. Runtime working memory. A 128K context window holds roughly 100,000 words. When it comes up: When a customer asks whether the model can handle long documents or full conversation histories. Don't confuse with: Training data. The context window is what the model sees right now; training data is what shaped the weights months ago.

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Parameters	All numerical values defining model behavior	Policy store or attribute database	You can't query individual rules out of a model the way you can query a policy store
Weights	Parameters updated during training — in practice, synonymous with parameters	Configuration values in an auth system	The distinction matters to ML engineers; it almost never surfaces in procurement
Quantization	Reducing numerical precision to shrink file size, trading some output quality	Schema compression or index optimization	Probabilistic degradation, not lossless compression — outputs shift, not just storage footprint

If you remember nothing else: A model is a file. File size determines hardware requirements, which determines where it can run, which determines who controls it. Every architecture conversation starts here.

The Hardware It Needs

GPU — A processor optimized for the parallel math that model inference requires. Running a large model on a CPU is technically possible and practically useless for anything latency-sensitive. When it comes up: Any conversation about on-premises deployment or "running it ourselves." Don't confuse with: General-purpose compute. The relevant spec for AI workloads is VRAM — the memory on the GPU — not core count.

VRAM — The GPU's onboard memory. The model's weights must fit in VRAM to run efficiently. A 35GB quantized model needs at least 40GB of VRAM — meaning an A100 80GB, or two A100 40GBs. When it comes up: Any self-hosted or on-premises deployment conversation. Don't confuse with: System RAM. A server with 512GB of RAM and 16GB of VRAM cannot run a 35GB model at useful speed. The weights need to live on the GPU, not in system memory.

Inference vs. training hardware — Inference (running the model) and training (building it from scratch) have different hardware profiles. Inference needs fast memory bandwidth. Training needs sustained compute across days or weeks, at a cost that starts in the tens of thousands of dollars. When it comes up: When a customer conflates "we want to use AI" with "we want to train AI." They almost certainly mean inference. Don't confuse with: Fine-tuning hardware requirements. Fine-tuning sits between inference and full training — more than inference, far less than training from scratch.

If you remember nothing else: VRAM is the hard constraint. Everything else is negotiable.

Who Controls It and Where

API-hosted model — A model you access over an API, running on the provider's infrastructure. You send text in, you get text back. You never touch the weights. When it comes up: OpenAI, Anthropic, Google Gemini — this is the default consumption model for most enterprise buyers starting out. Don't confuse with: Self-hosted. With API-hosted, the provider controls the model version, infrastructure, data handling, and uptime SLA. You control none of those things.

Self-hosted — Running the model on infrastructure you control, using open-weights models. You own the hardware, the weights, and the data that flows through it. When it comes up: Federal customers with data sovereignty requirements, or any customer asking "can we keep this inside our boundary?" Don't confuse with: Private cloud. "Self-hosted on AWS GovCloud" is not the same as "self-hosted on our own hardware" — the distinction matters for authorization scope and data residency conversations.

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Open (model)	Marketing term — means different things from different vendors; usually means open-weights	"Open standards" in IDAM — a claim, not a specification	Treat as the start of a question, not an answer
Open-weights	Weights are publicly available; training code and data may not be	Compiled binary of an open-source auth library — you can run it, not audit it	You can inspect what the model does; you cannot fully audit how it became that way
Open-source	Weights, training code, and ideally training data are all public	Full source release of an auth library	Genuinely rare among frontier models; most "open" AI is open-weights only

If you remember nothing else: Open-weights means you can run it. It doesn't mean you can see how it was built or what it was trained on.

What Running It Costs

Token (billing) — The unit of measurement for AI API pricing. Roughly 4 characters of English text. A 1,000-word document is approximately 1,300 tokens. Input tokens (what you send) and output tokens (what you receive) are priced separately — output typically costs 3–5x more, because generating text is computationally heavier than reading it. When it comes up: Any conversation about AI API costs. Don't confuse with: Tokens in IDAM. See table below. This is the term collision most likely to land awkwardly in a mixed vendor conversation.

Context window cost multiplier — Every token in the context window counts as input on every API call. A 2,000-token system prompt, sent with every request, multiplies your input cost accordingly. This is the cost nobody mentions in the keynote. When it comes up: When a customer is modeling AI costs and only counting the user's message, not the full context. Don't confuse with: Storage cost. Context window cost is per-call. The same 2,000-token system prompt costs money every single time it's sent.

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token (AI/billing)	A ~4-character text fragment; the unit for measuring LLM input and output	API call or transaction unit	Token count is content-dependent — a longer message costs more than a shorter one; there's no flat per-call equivalent
Token (IDAM)	A credential artifact — JWT, SAML assertion, OAuth bearer token — conveying identity or authorization	Same	Zero semantic overlap; the word collision is real and happens in front of buyers

If you remember nothing else: The cost on the pricing page is the input/output cost. The cost you don't see is the context window multiplied by every call.

How You Change It

Prompting — Changing the model's output by changing the input. No modification to the weights. Zero infrastructure cost beyond normal inference. When it comes up: When a customer asks how to make the model behave differently for their use case. This is almost always the right first answer. Don't confuse with: Fine-tuning. Prompting changes what you ask. Fine-tuning changes the model itself.

RAG (Retrieval-Augmented Generation) — Connecting the model to an external knowledge source at inference time, rather than baking knowledge into the weights. The model retrieves relevant documents and uses them as context for its response. When it comes up: When a customer wants the model to answer questions about internal documents. RAG keeps the data outside the model — a real distinction for data governance conversations. Don't confuse with: Fine-tuning. RAG doesn't touch the weights. The knowledge lives in a separate store you control, update, and audit independently.

Fine-tuning — Continuing the training process on a smaller, task-specific dataset to adjust the model's weights. Requires GPU compute, a curated dataset, and a data governance conversation about what's in that dataset. When it comes up: When a customer says they want the model to "know our stuff" or "sound like us." Ask whether RAG would serve the same goal before agreeing that fine-tuning is the answer. Don't confuse with: Training from scratch. Fine-tuning starts from existing weights. Training from scratch starts from random initialization and costs orders of magnitude more.

If you remember nothing else: Prompting changes behavior. Fine-tuning changes the weights. Only one of those requires a GPU cluster and a data governance conversation.

For More Information

The entries above trace back to five source articles in this section. Each is annotated for what it's useful for — not what it covers.

"What a Model Actually Is" Useful for: Explaining to a customer what they're buying, storing, or deploying when they talk about "a model." The parameters/weights/quantization vocabulary lives here.

"The Hardware Layer" Useful for: Any conversation about on-premises or self-hosted deployment. The VRAM constraint and the inference/training distinction are covered in detail here.

"Hosted vs. Self-Hosted: The Deployment Decision" Useful for: Federal and regulated-industry customers asking about data sovereignty and boundary control. The open/open-weights/open-source distinctions are developed here.

"The Real Cost of Inference" Useful for: Cost modeling conversations and any situation where a customer's AI budget assumptions look too clean. The context window multiplier and the input/output pricing asymmetry are the key takeaways.

"Modifying a Model: From Prompting to Retraining" Useful for: Customers who want to customize model behavior. Covers the full spectrum from zero-cost prompting through fine-tuning to training from scratch, with RAG as the data-governance-friendly path that most buyers overlook.

Production note: Model sizes, VRAM figures, token pricing ratios, and article titles are illustrative. Verify against current provider pricing pages and published model cards before using in customer-facing materials.