The Physical Reality of AI: Section Recap

By Leigh Garrity— May 8, 2026

The Physical Reality of AI: Section Recap

You've read the articles. The structure below is what makes them stick. Pull this up before a call with a CAIO or an infrastructure lead. The vocabulary tables are in the middle; the source index tells you where to go deeper if the conversation goes somewhere unexpected.

Mental Model 1: The File and Its Hardware Requirements

If you remember nothing else: A model is a file. The GPU is the only thing that can read it fast enough to matter.

Model weights — The actual file containing a trained model: billions of numerical values that encode everything the model learned. A 70-billion-parameter model in standard precision runs about 140GB on disk. A very large matrix of floating-point numbers — not organized like a database, not installed like an application. When it comes up: When a customer says they want to "run the model on-prem" — this is what they're proposing to store, move, and serve. Ask who owns the storage and what their read throughput looks like. Don't confuse with: Model size in terms of capability. Parameter count correlates with capability, but quantization can cut the file size roughly in half with modest quality tradeoffs. A 70B model quantized to INT4 fits in about 35GB. The file got smaller; the model is still 70B parameters.

GPU VRAM — The on-chip memory a GPU uses during inference. The entire model must fit in VRAM to run at production latency. System RAM doesn't substitute — it's an order of magnitude too slow. When it comes up: Any on-prem or private cloud deployment conversation. Two A100 80GB GPUs can serve a 70B model. One cannot. That's a hardware procurement decision, not a configuration option. Don't confuse with: Total server RAM. A machine with 512GB of system RAM and a single 24GB GPU cannot serve a 70B model at usable speed. VRAM is the binding constraint.

Quantization — Compressing model weights from 32-bit or 16-bit floats to lower-precision integers (INT8, INT4) to reduce VRAM requirements and increase throughput. When it comes up: When a customer is trying to run a large model on hardware they already own. The tradeoff is real but often acceptable: INT4 quantization of a 70B model typically degrades benchmark performance by 2–5%, which is invisible in most enterprise use cases. Don't confuse with: Fine-tuning. Quantization changes the precision of existing weights. Fine-tuning changes the weights themselves. Different operation, different cost, different risk profile.

Mental Model 2: The Location and Its Consequences

If you remember nothing else: Where the model runs is where the data goes. That's a data residency decision, not an IT decision.

Model endpoint — The URL or API surface that receives inference requests and returns completions. Every prompt your customer's application sends travels to this address. Every response comes back from it. When it comes up: In any conversation about data handling, logging, or breach surface. The endpoint owner controls what gets logged, retained, and potentially used for retraining. Federal data classification conversations start here. Don't confuse with: The application layer. The endpoint is the model service. The application calling it is a separate system with separate logging, separate access controls, and separate liability.

Hosted vs. self-hosted deployment — Hosted: the model runs on the vendor's infrastructure (OpenAI, Anthropic, Azure OpenAI Service). Self-hosted: the model runs on infrastructure the customer controls, whether on-prem or in their own cloud tenancy. When it comes up: FedRAMP authorization, data residency requirements, IL4/IL5 workloads. FedRAMP-authorized hosted endpoints exist (Azure Government OpenAI is the primary example as of mid-2026), but the authorization boundary matters. "Azure" isn't automatically "FedRAMP." The specific service and region determine the boundary. Don't confuse with: Open-source vs. proprietary models. A customer can self-host a proprietary model (via licensing) or use a hosted open-source model. The hosting decision and the licensing decision are independent.

Data residency — The legal and contractual requirement that data remain within a specified geographic or jurisdictional boundary. For AI workloads, this applies to prompts and completions, not just stored data. When it comes up: Every civilian agency conversation, every defense-adjacent account, and increasingly every state and local government deal. The prompt contains the sensitive data. The model processes it. If the model endpoint is in a non-compliant region, the data left the boundary the moment the request was sent. Don't confuse with: Data sovereignty, which is the broader legal framework. Residency is the operational constraint. Sovereignty is the policy rationale. Your customer's procurement team cares about residency. Their legal team cares about sovereignty. You need to speak both.

Mental Model 3: The Cost and Its Compounding Dynamics

If you remember nothing else: Agents don't make inference cheaper. They make it recursive.

Token (inference context) — The unit of text an LLM processes and generates. Roughly 750 words equals 1,000 tokens. Pricing is per token: input tokens (what you send) and output tokens (what comes back) are typically priced separately, with output costing 3–5x more. When it comes up: Budget conversations, ROI modeling, any discussion of "how much will this cost in production." A single user query might consume 500 tokens. A 10-step agentic workflow processing the same underlying task might consume 15,000–50,000 tokens. Same price per token; radically different usage pattern. Don't confuse with: Authentication tokens. See the vocabulary table below.

Context window — The maximum number of tokens a model can process in a single inference call, including both the input and the generated output. Current frontier models support 128K–1M token windows. Larger windows cost more per call and increase latency. When it comes up: When customers ask about processing long documents, maintaining conversation history, or feeding large policy corpora to a model. A 128K context window holds roughly 100,000 words — about the length of a full federal procurement document. Fitting it in one call is technically possible. Doing it repeatedly at production volume is a budget line item. Don't confuse with: Memory or storage. The context window is ephemeral. Nothing persists between calls unless the application explicitly re-injects prior context. Stateful agent design is hard precisely because the model has no memory of its own.

Agentic loop cost — The cumulative inference cost generated when an AI agent executes a multi-step task by making repeated model calls, each of which may include the full prior context. When it comes up: Any conversation about AI agents automating workflows. A single agent task that requires 10 model calls, each with a growing context, doesn't cost 10x a single query — it can cost 50–100x, because each call re-sends the accumulated conversation history. The second-order cost nobody mentions in the keynote. Don't confuse with: Orchestration overhead. The compute cost of the orchestration layer (routing calls, managing state) is negligible. The token cost of the inference calls is not.

Vocabulary Mapping: Linguistic and Data Terms

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token	Smallest unit of text processed by a model; the billing unit for inference	Bearer token, OAuth access token	In AI, a token is a fragment of text (sometimes a word, sometimes a syllable). In IDAM, a token is a credential artifact. No relationship between the two. Using "token" without qualification in a mixed conversation will cause confusion every time.
Context	The full text input sent to a model in a single call, including system prompt, history, and user query	Security context (user identity + attributes + session state)	AI context is ephemeral and stateless by default — it exists only for the duration of one inference call. IDAM context persists across a session and is the basis for authorization decisions. An AI model has no concept of "the current user" unless the application injects that information into the context explicitly.

Vocabulary Mapping: Identity and Workflow Terms

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Agent	An LLM-powered process that takes actions autonomously across multiple steps, often calling tools or APIs	Software agent (endpoint agent, directory sync agent)	An IDAM agent is a passive relay — it executes defined sync or auth operations. An AI agent makes decisions about what to do next based on model output. The AI agent needs its own identity, its own scoped credentials, and ideally its own audit trail. None of that is automatic.
Session	Colloquially, a conversation thread with a model; has no formal persistence unless the application builds it	Auth session (time-bounded, tied to a principal, revocable)	An AI "session" has no native timeout, no revocation mechanism, and no binding to an authenticated principal unless the application layer enforces all three. When a customer says their AI assistant "remembers" previous conversations, ask where that memory is stored and who can access it. The answer is almost never "the model."

Source Index

For the reader who wants to go deeper on a specific concept before a call.

"What a Model Actually Is: Weights, Parameters, and the File on Disk" Under the Hood, Article 1 Use this if the conversation turns to on-prem deployment feasibility or hardware procurement. Covers parameter count, weight precision, quantization tradeoffs, and why VRAM is the binding constraint rather than total compute. The section on INT4 quantization is the one to reread before any "we want to run it ourselves" conversation.

"Where Your Model Lives: Deployment Topology and What It Means for Your Customer" Under the Hood, Article 2 Reach for this before any federal account call where data classification has come up. Covers hosted vs. self-hosted architecture, FedRAMP authorization boundaries, data residency requirements for inference workloads, and the distinction between the model endpoint and the application layer. The section on prompt data as regulated data is the load-bearing argument.

"The Real Cost of Inference: Tokens, Loops, and the Bill Nobody Budgeted For" Under the Hood, Article 3 Use this before any conversation about AI ROI, budget modeling, or agent workflow design. Covers token pricing mechanics, context window cost dynamics, and agentic loop cost multiplication. The worked example of a 10-step agent task consuming 40x the tokens of a single query is the number worth having ready.

Not covered above, but worth knowing: model versioning and deprecation cycles. Hosted model providers update and deprecate model versions on schedules that don't align with enterprise change management cycles. This is an emerging procurement and continuity risk that none of the source articles address directly.