The Context Window | Under the Hood

The Context Window | Under the Hood — Issue 1

Under the Hood

VERSION 1Sunday, May 10

A model is the same file everywhere. What changes — hardware, jurisdiction, license, cost — is what your buyer is actually asking about.

Under the Hood

VERSION 1Sunday, May 10

A model is the same file everywhere. What changes — hardware, jurisdiction, license, cost — is what your buyer is actually asking about.

Section Opener

The File and the Envelope

By Leigh Garrity— May 9, 2026

Feature image for article: The File and the Envelope

A trained AI model is a file. A big one, sure, but still a file you can copy, move, and run on hardware you control. DeepSeek R1 hosted on servers in China and the same model running through Amazon Bedrock in Virginia are byte-for-byte identical weights with radically different security postures. The model is the constant. Everything around it is where the decisions live.

This section covers the six dimensions of that envelope: hardware, hosting, jurisdiction, licensing, cost, and customization. All six are showing up in the procurement conversations your buyers are already having, and none require you to understand how neural networks learn.

Section Opener

The File and the Envelope

Leigh Garrity

Former pre-sales engineer turned technical writer, with a decade in federal identity sales and a policy degree that taught her to read procurement logic before innovation logic. Writes like a sharp colleague: precise, direct, anchored in specifics, allergic to hype. Bridges what the reader already knows toward what they don't—then marks exactly where the bridge stops holding.

PART ONE

A Model Is a File

OPEN

"Deploying a model" means copying a file onto a machine with enough GPU memory to hold it. Gemma 4 fits on a $1,600 graphics card. DeepSeek R1 needs a GPU cluster at $15/hour. The difference is file size and the memory ceiling of the hardware. This piece profiles four current models as physical specimens — their weight in gigabytes, the GPU memory they demand, and the specific hardware tiers that run them. Once you see deployment as a file-size problem, the on-prem inference conversation stops being mysterious.

PART TWO

The Formula That Sets the Hardware Floor

OPEN

Every model ships with a parameter count — 7B, 70B, 405B — and that number directly sets the hardware floor for running it. One formula determines whether a model physically fits on a given machine: bytes per parameter times parameter count equals memory required. A 70B model at full precision needs 140 GB of VRAM. That's multiple datacenter GPUs before it processes a single token. Quantization compresses those weights to 4-bit precision, cutting memory by 75% with published benchmarks showing over 99% quality retention. When your buyer's team is debating on-premises deployment versus API access, the conversation runs through this formula. Worth knowing cold.

The Formula That Sets the Hardware Floor

PART THREE

Open Weights vs. Open Source vs. Closed — What the License Page Actually Says

OPEN

Your buyer says "open source" and means three different things in the same sentence. Only five research models meet the OSI's actual definition — none appear in enterprise procurement. Every model your buyer is evaluating — Llama 4, DeepSeek R1, Qwen 3 — is open-weight at best, with license terms ranging from MIT-permissive to military-use-prohibited to revocable. This piece covers what each category actually means, what the license pages say, and the precise vocabulary that keeps you credible when the CAIO brings up "open."

PART FOUR

Same Weights, Four Addresses

OPEN

DeepSeek R1 is the same file whether it runs on your hardware, in an AWS Bedrock region, inside GovCloud, or on a server in China. Byte-identical weights. Completely different prompt paths, legal envelopes, and telemetry exposure at each address. When a buyer says "we're evaluating DeepSeek," the next word out of your mouth should be where. All four deployment locations, profiled against the dimensions that actually surface in the room: who sees the prompts, where the bytes travel, what legal framework governs the data, and what operational burden lands on the customer. Same model, four addresses, four different conversations.

PART FIVE

Why Hosting Location Is the Whole Ballgame

OPEN

DeepSeek R1 on AWS Bedrock and DeepSeek R1 on DeepSeek's own API use identical weights. Same model, same file. The legal, security, and operational envelope around the inference is completely different: which government can compel disclosure of your prompts, what a breach actually exposes, whether you can prove what happened in an audit, and how badly latency compounds when an agent makes 40 tool calls per task. This piece compares five hosting arrangements across the dimensions that drive procurement decisions. Jurisdiction follows corporate ownership. Server location is a different question entirely. That distinction will reshape how your next conversation about "just using the EU region" goes.

Why Hosting Location Is the Whole Ballgame

PART SIX

Every Agent Turn Costs More Than the Last

OPEN

LLM providers publish per-token pricing. Multiply by expected turns, and you have a budget. For agent workloads, that budget is wrong by an order of magnitude. Every agent turn resends the entire prior conversation — every tool result, every reasoning step — because the API is stateless. A 20-step loop doesn't cost 20 turns of input. It costs 210. Stanford researchers found the same agent on the same task can vary by 30× across runs. Context accumulation is the mechanism behind the cost surprises that kill agent deployments before they leave pilot.

Every Agent Turn Costs More Than the Last

PART SEVEN

Training, Fine-Tuning, and Prompting — What Each One Actually Changes in the Model

OPEN

When a public sector buyer says "we're customizing the model," they could mean three things separated by orders of magnitude: training from scratch (hundreds of millions of dollars), fine-tuning (thousands to millions), or prompting (cents per request). Most buyers don't distinguish between them. OpenAI shut down self-serve fine-tuning to new users in May 2026. Context windows now handle over a million tokens, roughly 1,500 pages per prompt. Fine-tuning's territory has shrunk considerably, and the optimization hierarchy that replaced it gives you a framework for the conversation. One question earns immediate credibility: "Are you adapting the model's weights, or shaping behavior at inference time?"

Training, Fine-Tuning, and Prompting — What Each One Actually Changes in the Model

Recap — A Model Is a File and Six Things That Change Around It

A model is a file. Everything else — hardware, compression, licensing, hosting, cost, customization — is a variable that changes around it. You just read the long version. This is the version you pull up in the parking lot before the call. Every entry follows the same shape: what the term means, when it surfaces in a conversation, and what it gets confused with. The vocabulary collision tables at the end map AI terms to the IDAM concepts already in your head, with the specific point where the analogy breaks.

VIEW ARTICLE

PART ONE

A Model Is a File

OPEN

PART TWO

The Formula That Sets the Hardware Floor

OPEN

PART THREE

Open Weights vs. Open Source vs. Closed — What the License Page Actually Says

OPEN

PART FOUR

Same Weights, Four Addresses

OPEN

PART FIVE

Why Hosting Location Is the Whole Ballgame

OPEN

PART SIX

Every Agent Turn Costs More Than the Last

OPEN

PART SEVEN

Training, Fine-Tuning, and Prompting — What Each One Actually Changes in the Model

OPEN

Recap — A Model Is a File and Six Things That Change Around It

VIEW ARTICLE

Quick-Scan Reference

Your buyer just mentioned running a 671B-parameter model on-prem and you need to know, in about thirty seconds, whether that requires a server rack or a small building. Or they dropped "DeepSeek" and now you're wondering if this is a data sovereignty conversation, a licensing conversation, or both. (It's both.)

These five cards are built to live on your phone. Each covers one dimension of what models physically are, what they cost to run, and what the legal fine print actually says when someone bothers to read it. VRAM requirements, license terms, data paths, agent cost math, and the spectrum from prompt engineering to pre-training.

Glance before the call. Know what you're walking into.

Quick-Scan Reference

Glance before the call. Know what you're walking into.

VRAM Reference

Parameters Set the Hardware Bill, Period

At 4-bit quantization: 7-8B models fit in ~5-6GB VRAM. A 70B model wants ~35GB. Full precision roughly doubles everything. Past 405B you're talking datacenter multi-GPU. The MoE wrinkle matters here: DeepSeek R1 carries 671B total parameters but fires only 37B per token. Total parameters set the memory floor. Active parameters set inference cost. Two different numbers, two different budget lines.

License Check

"Open Weights" and "Open Source" Are Different Things

Current landscape: DeepSeek R1/V3.2/V4 ship under MIT. Mistral Large 3, Gemma 4, and OpenAI's GPT-OSS all carry Apache 2.0. Llama 4 uses Meta's Community License, which caps you at 700M monthly active users and requires attribution. Qwen 3.5 is Apache 2.0 for smaller variants, Tongyi Qianwen license for the flagship. Here's the thing none of them advertise loudly: not one releases training data. So by the OSI's OSAID v1.0 definition, none of them qualify as open source AI. If your buyer says "open source," ask which kind they mean.

Data Path

Same Weights, Completely Different Legal Jurisdiction

Four hosting locations, four different answers to "who sees my data." Local: you do, nobody else. Hyperscaler: the cloud provider's infrastructure, their contractual commitments. Sovereign cloud: jurisdiction-specific guarantees. Provider API: the model maker's servers, their terms. The DeepSeek case makes this concrete. Run DeepSeek on AWS Bedrock and your data stays in US data centers. No bits flow to DeepSeek's infrastructure. Call DeepSeek's own API and your prompts route through Chinese servers. Identical model file. Entirely different legal envelope. AWS Bedrock itself offers three inference tiers worth knowing: In-Region keeps data in a single AWS region, Geo Cross-Region allows movement within a geography, and Global Cross-Region routes worldwide. Public sector buyers with residency requirements need In-Region. Full stop.

Agent Costs

Every Agent Step Rebuilds the Entire Bar Tab

A single query costs what it costs. An agent loop is different. Each subsequent step re-reads the full accumulated context and gets billed for all of it again. Stanford's Digital Economy Lab measured this: agentic tasks consumed 1,000× more tokens than single-turn code reasoning. The culprit is input tokens, not output, because every action re-ingests everything that came before. Output tokens do cost 3-10× more per unit, but volume wins. Costs varied 30× across identical task runs, which should give everyone pause. The main relief valve right now is prompt caching. Anthropic, OpenAI, and Google all offer roughly 90% discounts on cache hits for repeated context. If your buyer is planning agent deployments and hasn't mentioned caching, that's a useful question to surface.

Customization Spectrum

Three Levers, Wildly Different Price Tags

Training from scratch: $78M to north of $1B, months to years of work, and only frontier labs attempt it. This is building the engine from raw metal. Fine-tuning with LoRA: $5K-$50K, a few weeks to a couple months, and it embeds new behavior directly into the model's weights. Persistent changes. The model carries what it learned. Prompting and context engineering: cents per call, deployable in hours, gone when the session ends. Call-scoped only. OpenAI's own optimization docs lay out a hierarchy that's worth repeating to buyers: build evals first so you can measure what's working, then iterate on prompts until you've exhausted what they can do, and only then reach for fine-tuning. Most teams skip straight to the expensive step. Most teams regret it.