Under the Hood
Under the Hood
The File and the Envelope

A trained AI model is a file. A big one, sure, but still a file you can copy, move, and run on hardware you control. DeepSeek R1 hosted on servers in China and the same model running through Amazon Bedrock in Virginia are byte-for-byte identical weights with radically different security postures. The model is the constant. Everything around it is where the decisions live.
This section covers the six dimensions of that envelope: hardware, hosting, jurisdiction, licensing, cost, and customization. All six are showing up in the procurement conversations your buyers are already having, and none require you to understand how neural networks learn.
The File and the Envelope
A trained AI model is a file. A big one, sure, but still a file you can copy, move, and run on hardware you control. DeepSeek R1 hosted on servers in China and the same model running through Amazon Bedrock in Virginia are byte-for-byte identical weights with radically different security postures. The model is the constant. Everything around it is where the decisions live.
This section covers the six dimensions of that envelope: hardware, hosting, jurisdiction, licensing, cost, and customization. All six are showing up in the procurement conversations your buyers are already having, and none require you to understand how neural networks learn.
"Deploying a model" means copying a file onto a machine with enough GPU memory to hold it. Gemma 4 fits on a $1,600 graphics card. DeepSeek R1 needs a GPU cluster at $15/hour. The difference is file size and the memory ceiling of the hardware. This piece profiles four current models as physical specimens — their weight in gigabytes, the GPU memory they demand, and the specific hardware tiers that run them. Once you see deployment as a file-size problem, the on-prem inference conversation stops being mysterious.

Every model ships with a parameter count — 7B, 70B, 405B — and that number directly sets the hardware floor for running it. One formula determines whether a model physically fits on a given machine: bytes per parameter times parameter count equals memory required. A 70B model at full precision needs 140 GB of VRAM. That's multiple datacenter GPUs before it processes a single token. Quantization compresses those weights to 4-bit precision, cutting memory by 75% with published benchmarks showing over 99% quality retention. When your buyer's team is debating on-premises deployment versus API access, the conversation runs through this formula. Worth knowing cold.

Your buyer says "open source" and means three different things in the same sentence. Only five research models meet the OSI's actual definition — none appear in enterprise procurement. Every model your buyer is evaluating — Llama 4, DeepSeek R1, Qwen 3 — is open-weight at best, with license terms ranging from MIT-permissive to military-use-prohibited to revocable. This piece covers what each category actually means, what the license pages say, and the precise vocabulary that keeps you credible when the CAIO brings up "open."

DeepSeek R1 is the same file whether it runs on your hardware, in an AWS Bedrock region, inside GovCloud, or on a server in China. Byte-identical weights. Completely different prompt paths, legal envelopes, and telemetry exposure at each address. When a buyer says "we're evaluating DeepSeek," the next word out of your mouth should be where. All four deployment locations, profiled against the dimensions that actually surface in the room: who sees the prompts, where the bytes travel, what legal framework governs the data, and what operational burden lands on the customer. Same model, four addresses, four different conversations.

DeepSeek R1 on AWS Bedrock and DeepSeek R1 on DeepSeek's own API use identical weights. Same model, same file. The legal, security, and operational envelope around the inference is completely different: which government can compel disclosure of your prompts, what a breach actually exposes, whether you can prove what happened in an audit, and how badly latency compounds when an agent makes 40 tool calls per task. This piece compares five hosting arrangements across the dimensions that drive procurement decisions. Jurisdiction follows corporate ownership. Server location is a different question entirely. That distinction will reshape how your next conversation about "just using the EU region" goes.

LLM providers publish per-token pricing. Multiply by expected turns, and you have a budget. For agent workloads, that budget is wrong by an order of magnitude. Every agent turn resends the entire prior conversation — every tool result, every reasoning step — because the API is stateless. A 20-step loop doesn't cost 20 turns of input. It costs 210. Stanford researchers found the same agent on the same task can vary by 30× across runs. Context accumulation is the mechanism behind the cost surprises that kill agent deployments before they leave pilot.

When a public sector buyer says "we're customizing the model," they could mean three things separated by orders of magnitude: training from scratch (hundreds of millions of dollars), fine-tuning (thousands to millions), or prompting (cents per request). Most buyers don't distinguish between them. OpenAI shut down self-serve fine-tuning to new users in May 2026. Context windows now handle over a million tokens, roughly 1,500 pages per prompt. Fine-tuning's territory has shrunk considerably, and the optimization hierarchy that replaced it gives you a framework for the conversation. One question earns immediate credibility: "Are you adapting the model's weights, or shaping behavior at inference time?"

Recap — A Model Is a File and Six Things That Change Around It
A model is a file. Everything else — hardware, compression, licensing, hosting, cost, customization — is a variable that changes around it. You just read the long version. This is the version you pull up in the parking lot before the call. Every entry follows the same shape: what the term means, when it surfaces in a conversation, and what it gets confused with. The vocabulary collision tables at the end map AI terms to the IDAM concepts already in your head, with the specific point where the analogy breaks.
