Recap — A Model Is a File and Six Things That Change Around It

Structured reference scaffold consolidating model files, GPU requirements, licensing traps, hosting tiers, token economics, and customization approaches for deal prep.

By Leigh Garrity— May 9, 2026

Recap — A Model Is a File and Six Things That Change Around It

Structured reference scaffold consolidating model files, GPU requirements, licensing traps, hosting tiers, token economics, and customization approaches for deal prep.

A model is a file. Everything a buyer cares about is a variable that changes around that file: what hardware runs it, how it gets compressed, what the license permits, where it's hosted, what each use costs, and how it gets customized. That's the scaffold. Everything below hangs on it.

The Invariant: The File

Model weights (parameters) — The numerical values a model learned during training, stored in a file. A 70-billion-parameter model is 70 billion numbers on a disk. The file doesn't run. It sits there until something loads it into memory and does math against it.

When it comes up: Every time a buyer asks "where does the AI live?" It's a file. The file lives wherever someone puts it.
Don't confuse with: Inference (the running model). The file is inert. Inference is what happens when you load it and send it a prompt.

Active vs. total parameters — Mixture-of-experts models (DeepSeek V4-Pro: 1.6T total, 49B active) only fire a fraction of their parameters per query. The headline number is marketing. The active number is the infrastructure requirement.

When it comes up: Buyer sees "1.6 trillion parameters" and assumes they need a data center. They need hardware for 49 billion.
Don't confuse with: Total parameter count. These are not the same number and they imply very different hardware.

If you remember nothing else: A model is a file you can copy, move, host, and lock down like any other file. Everything interesting happens in the math that runs against it.

Hardware & Quantization

GPU VRAM — The memory on a graphics card that holds model weights during inference. The model must fit in VRAM, or be split across multiple cards, to run.

When it comes up: Any self-hosted or on-prem deployment conversation. "How many GPUs?" is a VRAM question.
Don't confuse with: System RAM. VRAM is on the GPU. Regular server memory doesn't help here.

Quantization — Compressing weights from high-precision formats (16 bits per number) to lower ones (8 or 4 bits). Cuts VRAM needs in half or more. Costs some accuracy.

When it comes up: When the buyer wants to run a model on fewer or cheaper GPUs. The tradeoff that makes self-hosting economically viable.
Don't confuse with: Pruning or distillation (different compression techniques). Quantization changes precision of existing weights, not the architecture.

If you remember nothing else: Quantization is the knob between "runs on hardware we can afford" and "gives answers we can trust." Every self-hosted deployment turns it.

Licensing

Open weights vs. open source — Open weights means the model file is downloadable. Open source, per OSAID v1.0, requires weights plus training data information, code, and documentation sufficient to recreate the system.

Almost no major model qualifies. As of May 2026, five models pass OSAID: Pythia, OLMo, Amber, CrystalCoder, T5. None are frontier-class.
Llama 4: restrictive community license with usage caps and acceptable-use restrictions. DeepSeek V4: MIT license (permissive, no usage caps). Neither is OSAID-compliant because neither releases training data.
The OSAID definition itself is under revision, with an update expected Q4 2026.
When it comes up: Every time a buyer says "we'll use an open-source model." They mean open-weight. The procurement implications of MIT vs. Llama's license are completely different.
Don't confuse with: Each other. "Open" is doing a lot of unsupervised work in AI right now.

If you remember nothing else: "Open weights" means you can hold the file. It doesn't mean you can do what you want with it. Read the license like a contract, because it is one.

Hosting & Jurisdiction

Bedrock three-tier inference — AWS routes model requests through three tiers: In-Region (data stays in one AWS region), Geo Cross-Region (stays within a named geography: US, EU, Japan, or Australia), Global (routes anywhere for throughput). On Bedrock, inference data stays in the selected tier and is never shared with the model provider.

When it comes up: Every public sector AI conversation. The buyer's data residency requirement maps to a tier. In-Region is the answer for strict regulatory environments.
Don't confuse with: Training origin. Where a model learned and where your data goes during inference are separate facts. DeepSeek was trained in China. On Bedrock In-Region, your prompt never leaves Northern Virginia.

Latency vs. data residency — Not covered in depth in the source articles, but worth knowing: In-Region keeps data local but limits throughput to one region's capacity. Geo Cross-Region improves throughput by routing across regions within a geography. The tradeoff is real: stricter residency means fewer routing options, which can mean slower responses under load. Ask the buyer which constraint they're solving for.

If you remember nothing else: Bedrock's tier system turns "where does our data go?" from a political question into an infrastructure configuration.

Inference Cost

Tokens (LLM) — The units a model reads and generates. Roughly ¾ of a word. Billed separately for input and output. Output costs several times more, ranging from roughly 2× to 10× depending on the provider and model.

When it comes up: Every pricing and sizing conversation.
Don't confuse with: Authentication tokens. See collision table below.

Agentic token cost — A Stanford DEL study (April 2026, arXiv preprint) found agentic coding tasks consume 1,000× more tokens than standard code reasoning. The mechanism: agents re-read the entire conversation history before every action, and that history compounds. Same-task runs varied by up to 30×. More tokens did not mean better results.

When it comes up: When a buyer plans to deploy AI agents. Per-token price looks cheap. Per-task cost is not.
Don't confuse with: Output cost. Input tokens drive agentic cost because the agent re-reads everything, every turn.

Reference pricing (May 2026, verify before use):

Model	Input $/MTok	Output $/MTok	Notes
DeepSeek V4 Flash	$0.30	$0.50	MIT license, 1M context
Gemini 2.5 Flash	$0.30	$2.50	Google's budget flagship
GPT-4.1	$2.00	$8.00	OpenAI production workhorse, 1M context
Claude Opus 4.7	$5.00	$25.00	New tokenizer may inflate effective cost up to 35%

All major providers offer batch (50% off) and prompt caching (~90% off repeated input).

If you remember nothing else

The sticker price is per token. The bill is per task. Agents turn cheap tokens into expensive workflows because they re-read everything, every step.

Customization

RAG (Retrieval-Augmented Generation) — Feeding the model your documents at query time instead of retraining it. The weights don't change. You change what goes into the prompt.

When it comes up: "How do we make it know our stuff?" RAG is the answer that doesn't require GPUs, training pipelines, or an ML team.
Don't confuse with: Fine-tuning (changes the weights permanently) or prompt engineering (changes the instruction, not the source material).

Fine-tuning — Retraining a model's weights on your data so the behavior changes permanently. Requires compute, labeled data, and someone who knows what they're doing.

When it comes up: When RAG has hit its ceiling. The buyer needs the model's behavior to change permanently, beyond what feeding it new documents can fix.
Don't confuse with: RAG. Fine-tuning changes the file. RAG changes the input.

Prompt engineering — Writing better instructions. No infrastructure, no retraining. The cheapest customization by far, and where most teams should start.

When it comes up: Before everything else. If the buyer hasn't tried writing a better prompt, they don't need fine-tuning yet.
Don't confuse with: RAG. Prompt engineering shapes the question. RAG shapes the source material the model sees.

If you remember nothing else: RAG is the customization approach that doesn't require an ML team. Start there. Fine-tune when the model needs to act differently. If the gap is knowledge, RAG handles that.

Vocabulary Collision Tables

Table 1: Token, Session, Context

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token	~¾ of a word; billing and processing unit	OAuth/OIDC token (access, refresh, ID)	AI tokens are consumed and gone. IDAM tokens are issued, validated, and revoked. AI tokens carry no identity claim.
Session	Conversation with accumulated context; stateless between API calls unless context is re-sent	Authenticated session tied to a user identity	AI sessions carry no authentication state. "Session" means "the prompt history I'm sending again."
Context window	Max tokens a model processes in one call (e.g., 1M tokens)	Security context (user attributes, entitlements, environment signals)	AI context is a size limit. Security context is a trust evaluation. They share a word and nothing else.

Table 2: Scope, Agent, Policy

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Scope	Range of training data or task domain a model addresses	OAuth scope (permissions granted to a client)	AI scope is informal and unenforceable. OAuth scope is a machine-readable permission boundary. A model has no scope boundary to exceed. The failure mode is hallucination.
Agent	AI system taking multi-step actions autonomously, consuming tools and APIs	Service account, machine identity	AI agents act autonomously but have no native identity, credential lifecycle, or access governance. The gap your product fills.
Policy	Acceptable-use restrictions in a model license; content filtering rules	Authorization policy, conditional access	AI policy is a legal/content constraint expressed in prose. IDAM policy is a technical enforcement mechanism expressed in code.

Source Index

Recap Entry	Source
Model weights, active vs. total parameters	DeepSeek V4-Pro model card (HuggingFace)
Quantization	Hardware vendor documentation; quantization quality varies by model and task — avoid generalizing from single benchmarks
Open weights vs. open source	OSAID v1.0 definition; OSI validated model list; DeepSeek V4 MIT license (HuggingFace)
Bedrock three-tier inference	AWS regional availability docs; AWS DeepSeek R1 launch blog
Tokens, agentic token cost	Stanford DEL, arXiv:2604.22750 (preprint, April 2026)
Reference pricing	Anthropic official; MetaCTO (OpenAI); pricepertoken.com (Gemini); Chat-Deep (DeepSeek); Opus 4.7 tokenizer analysis
RAG, fine-tuning, prompt engineering	Customization approaches covered in source articles

Verify before citing

DeepSeek V4 may now be available on Bedrock per third-party reporting (May 2026), but AWS official documentation has not confirmed this as of May 9, 2026.

Things to follow up on...

Claude's hidden price increase: Anthropic's Opus 4.7 kept the same $5/$25 rate card but shipped a new tokenizer that can produce up to 35% more tokens for the same input text, meaning effective per-task cost rises even when the sticker price doesn't.
Agent cost compounding math: The Stanford Digital Economy Lab paper found that naive agent loops rebill prior context on every call, and one analysis showed 39–60% of tool-result tokens were removable with no performance loss, which makes context engineering the primary cost lever for agentic deployments.
OSAID revision timeline: The Open Source Initiative is running a process to address major gaps in the current definition, with an updated OSAID expected by Q4 2026, which could change how buyers and procurement offices classify every "open" model on the market.
DeepSeek V4 promo expiration: DeepSeek V4-Pro is running at a 75% promotional discount through May 31, 2026, after which the list price jumps from $0.435/$0.87 to $1.74/$3.48 per MTok — a 4× increase that could reshape any cost comparison built on current rates.