Opener
When a public sector buyer says "we're customizing the model," they might mean training one from scratch (hundreds of millions of dollars, only frontier labs do it), fine-tuning an existing model on their data (thousands to low millions), or prompting at inference time (cents per request). You'll encounter all three in buyer conversations, often without the buyer distinguishing between them. The sentence that buys you credibility: "Are you adapting the model's weights, or shaping its behavior at inference time?" It separates the approaches at the mechanism level and signals you understand the plumbing. It also gives the buyer a reason to tell you what they're actually doing.
Training from Scratch
What it is: Building a foundation model by initializing and optimizing billions of neural network parameters against a massive training corpus.
What it does: Produces a general-purpose model that encodes language understanding, reasoning, and world knowledge from raw data. GPT-4, Claude, and Gemini all started here. In practice, you're assembling a dataset of hundreds of billions of tokens, running forward and backward passes across thousands of GPUs for weeks or months, and iterating on architecture decisions throughout. BloombergGPT, a 50-billion-parameter financial model, required 512 A100 GPUs for 53 days, processing 709 billion tokens (363 billion of them proprietary financial data), at an estimated cost of $3–10 million. That's modest by frontier standards. Current frontier training runs are reported in the hundreds of millions, though transparent cost methodologies for those figures are scarce.
Who's behind it: OpenAI, Anthropic, Google DeepMind, Meta, and a small number of well-funded startups. No enterprise buyer is doing this. If someone in a meeting says "training a model," they almost certainly mean fine-tuning.
What makes it distinct: You own the model's knowledge from the ground up. You also own the entire risk surface, the compute bill, and the problem of keeping it current. BloombergGPT is the cautionary tale worth knowing: GPT-4 outperformed it on most financial benchmarks, including FinQA (68.79% vs. 43%) and named entity recognition (83% vs. 61%), because general-purpose scale moved faster than domain specialization. The model you train today competes with the frontier model that ships next quarter.
Fine-Tuning
What it is: Taking a pre-trained model and further training it on a smaller, task-specific dataset to modify its weights.
What it does: Changes the model's behavior persistently. After fine-tuning, the model produces different outputs without special instructions in the prompt. The use cases where this still earns its keep: enforcing rigid output formats across thousands of API calls, reducing token usage by baking behavior into weights instead of spelling it out every time, and handling specialized domain vocabulary at high volume. Anthropic published data showing fine-tuning Claude 3 Haiku for content moderation improved accuracy from 81.5% to 99.6% while reducing tokens per query by 85%. Fine-tuning earns its keep on consistency and cost-per-call for a narrow task at scale.
Who's behind it: OpenAI offered self-serve fine-tuning through its API until May 2026, when it shut that platform to new users. OpenAI's own fine-tuning documentation now states the platform is winding down; existing customers can run jobs through January 2027. (The developer community forum contains the most detailed timeline; no separate blog announcement existed at time of writing.) Anthropic offers fine-tuning exclusively through Amazon Bedrock. Google offers it for Gemini models. The trend line: fine-tuning is moving from self-serve to managed enterprise relationships.
What makes it distinct: Persistence. The behavioral change survives across every request without being restated. But persistence cuts both ways. A fine-tuned model is a new artifact to version, secure, and maintain. The NIST AI Risk Management Framework now names fine-tuning poisoning as a distinct threat category alongside prompt injection. For public sector buyers, that means fine-tuning creates an attack surface that prompting does not.
Fine-tuning works like building a custom SAML attribute mapping for a specific relying party: configure it once, it persists across sessions, it changes how the system behaves for that integration. Both require a specific justification because both create maintenance burden. Where the analogy breaks: a SAML mapping is configuration you can roll back cleanly. A fine-tuned model's weight changes are permanent to that copy, and debugging unexpected behavior is harder than reading an attribute statement.
Prompting and Context Engineering
What it is: Sending instructions, examples, and reference data to the model at inference time without modifying its weights.
What it does: Shapes the model's behavior for a single request or conversation. This includes system prompts (persistent instructions for a session), few-shot examples (showing the model what good output looks like), retrieval-augmented generation (pulling relevant documents into the context window at query time), and tool use (giving the model access to external functions). Context windows grew from roughly 2,000 tokens in GPT-3 to over 1 million tokens in current frontier models, and that expansion is what ate most of fine-tuning's territory. GPT-4.1 handles 1,047,576 tokens. Google's Gemini 3 and Anthropic's Claude both offer 1-million-token windows. Meta's Llama 4 Scout pushes to 10 million tokens. That's roughly 1,500 pages of text in a single prompt at the 1M mark. Knowledge that used to require fine-tuning to inject can now be dropped directly into the context window.
Who's behind it: Everyone. Every model provider supports it. OpenAI's optimization hierarchy puts prompting second (after writing evaluation criteria) and fine-tuning third, explicitly positioning prompting as the approach you should exhaust before escalating. Anthropic's guidance mirrors this structure: establish success criteria and build evaluations first.
What makes it distinct: Zero persistence, maximum flexibility. Nothing changes in the model. You can update behavior instantly by editing the prompt. The tradeoff is per-request cost: instructions and examples consume tokens on every call. But prompt caching has compressed this gap significantly. A cached system prompt with 20 examples at moderate volume costs under a dollar per day and, per Anthropic's published analysis, often matches fine-tuning performance on small datasets.
Prompting works like scopes in an OAuth token: you define what the model should do at request time, and the system responds within those constraints. Both are runtime controls, both are stateless across requests, both let you change behavior without touching the underlying system. Where it breaks: OAuth scopes are binary (granted or denied). Prompt instructions are probabilistic. The model might not follow them perfectly every time, which is why format consistency above 99% sometimes still requires fine-tuning.
Where Each Approach Wins and Where It Doesn't
I'm using trait-led analysis here because buyer conversations almost always come down to one of three dimensions: cost, persistence, or time to deploy. Mapping each approach against all three gives you the fastest path to a useful answer in the room.
| Dimension | Training from Scratch | Fine-Tuning | Prompting / Context Engineering |
|---|---|---|---|
| Upfront cost | $3M–$100M+ (GPU clusters, months of compute; BloombergGPT's $3–10M is the low end) | Hundreds to thousands for small API jobs; enterprise engagements through Bedrock or managed programs run significantly higher | Near zero (engineering time only) |
| Per-request cost | Depends on hosting infrastructure you own or lease | Lower per call (behavior baked into weights means shorter prompts, fewer tokens) | Higher per call (full instructions sent each time, though caching reduces this substantially) |
| Persistence | Permanent. It is the model | Permanent to that model copy. Survives across all requests | None. Must be re-sent or cached per session |
| Time to deploy | Months to years | Hours to weeks, depending on dataset and iteration | Minutes to days |
| Time to change behavior | Retrain (months) | Retrain on new data (hours to weeks) | Edit the prompt (minutes) |
| When it's the right call | You are a frontier lab, or you need a model that doesn't exist yet | High-volume, narrow task where format consistency or per-call token savings at scale justify the investment | Everything else. Prove this doesn't work before moving up the spectrum |
The last row is the one your buyer will test you on. OpenAI's optimization guide documents this as a hierarchy, not a menu. Write evals first. Optimize prompts second. Fine-tune only if a specific, measurable gap remains. OpenAI's decision to wind down self-serve fine-tuning turned that hierarchy from advice into architecture.
Fine-tuning still wins in a specific, narrow scenario: when you need near-perfect format consistency on a high-volume, stable task where per-call token savings justify the upfront investment. Anthropic's content moderation result (81.5% → 99.6% accuracy, 85% token reduction) is the cleanest published example. The conditions are worth noting: high volume, narrow task, stable requirements, measurable accuracy target. Remove any one of those and prompting is almost certainly the better starting point.
For public sector buyers specifically: NIST's draft COSAiS framework treats "fine-tuning predictive AI" as a distinct deployment category from "using and adapting generative AI assistants," each with its own anticipated control overlay. Agencies that fine-tune models on sensitive data carry a different risk profile than those using prompting alone. That distinction will surface in AI risk assessments, and it's worth raising with CISOs who may not have considered it.
OpenAI's "evals → prompts → fine-tuning" hierarchy maps to how experienced IDAM architects approach integrations: start with OIDC and SCIM, escalate to custom SAML only when you have a specific requirement the standard can't handle, build a custom connector as a last resort. Where the analogy breaks: in IDAM, the custom approach is more work but equally reliable. In AI, the simpler approach (prompting) is sometimes more reliable than fine-tuning, because fine-tuning can introduce unexpected behaviors that are hard to diagnose.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "You should fine-tune the model on your data" | "Current best practice is to start with prompting and retrieval, and only fine-tune if you hit a specific performance gap you can measure" | Matches OpenAI's documented hierarchy and avoids recommending unnecessary complexity |
| "Training and fine-tuning are basically the same thing" | "Training builds a model from scratch — that's hundreds of millions of dollars. Fine-tuning adapts an existing model — thousands to tens of thousands. They're different by orders of magnitude" | Buyers conflate these constantly; separating them shows you understand the economics |
| "Fine-tuning is the way to customize AI" | "Fine-tuning changes the model's weights permanently. Prompting shapes behavior at request time. Most customization today starts with prompting because context windows are large enough to hold the instructions that used to require fine-tuning" | Positions you as current on the state of practice |
| "We need to train a model for this use case" | "Are you looking to adapt an existing model's behavior, or do you need capabilities no current model has? Those are very different projects" | Reframes an imprecise statement into a decision the buyer can actually make |
| "OpenAI fine-tuning is the standard approach" | "OpenAI shut down self-serve fine-tuning to new users in May 2026. They're steering developers toward prompting, retrieval, and tool use for most customization" | Shows you're tracking current developments, not repeating 2023 conventional wisdom |
| "Fine-tuning is dead" | "Fine-tuning still makes sense for high-volume, narrow tasks where you need format consistency above 99% and the per-call token savings justify the investment. That's a narrower set of use cases than most people assume" | Avoids overcorrecting; shows nuanced understanding |
| "The model needs to be trained on your policies" | "Your policy documents can go directly into the context window now. Frontier models handle over a million tokens — that's roughly 1,500 pages of text in a single prompt" | Gives the buyer a concrete sense of scale that reframes the customization question |
| "AI customization is a security risk" | "NIST names fine-tuning poisoning as a specific threat category. Prompting doesn't modify the model, so it carries a different risk profile. That distinction matters for your AI risk assessment" | Connects customization choices to a governance framework the buyer recognizes |
| "We should build our own model" | "Bloomberg spent $3 to $10 million training a 50-billion-parameter financial model. GPT-4 outperformed it on most financial benchmarks within a year. Training from scratch is usually a depreciating asset" | Concrete cautionary example without editorializing |
| "Context windows solve everything" | "Long context windows eliminated the 'knowledge doesn't fit' argument for fine-tuning. The consistency and cost-per-call arguments still stand. It depends on volume and how tight your format requirements are" | Shows you understand the nuance, not just the trend |
The optimization hierarchy — evals, then prompts, then fine-tuning only with justification — is the single most useful framework for these conversations. If a buyer is reaching for fine-tuning, ask them: what specific, measurable gap did prompting leave? If they can't answer that, they're solving a problem that probably doesn't require weight modification. And as of May 2026, OpenAI agrees enough to have shut the door on self-serve access.
Things to follow up on...
- OpenAI's fine-tuning shutdown timeline: Existing customers retain access through January 2027, but the developer community thread tracking the wind-down is worth monitoring for changes to the cutoff date and what replaces self-serve access.
- NIST COSAiS control overlays: The draft framework defines separate deployment categories for fine-tuning and prompting-based AI, with distinct control overlays anticipated in late 2026 to 2027 that will shape how agencies evaluate model customization risk.
- Context window cost at scale: Even with million-token windows available, processing long contexts carries real per-call expense, and the FinOps Foundation's 2026 report documents how context-window creep is driving enterprise AI bills higher even as per-token prices drop.
- Prompt caching as fine-tuning substitute: Anthropic's published analysis suggests a cached system prompt with examples often matches fine-tuning performance on small datasets at a fraction of the cost, a pattern worth tracking as caching features expand across providers.

