Pre-training, fine-tuning, and inference are the three phases of a model's lifecycle, and they differ by roughly five orders of magnitude in cost. NIST AI 600-1uses exactly this vocabulary to describe the generative AI lifecycle, which means your public sector buyer's compliance team probably already has the terms in a document somewhere. You'll encounter them when a buyer says some version of "we want to train the model on our data." That sentence can mean three very different things, and the difference is the difference between a few hundred dollars, a few thousand, or a few hundred million. Knowing which regime the buyer actually needs is what keeps that conversation productive.
One term you'll see throughout: RAG (retrieval-augmented generation) is a pattern that runs during inference, not a separate regime. It's covered in Lesson 3. It matters here because "train it on our data" very often means RAG, and recognizing that saves everyone time and money.
Pre-Training
What it is: Building a model's weights from scratch by exposing it to massive datasets.
What it does: Pre-training creates the foundation model — the base model that everyone else builds on. Trillions of tokens of text, code, and other modalities get processed across thousands of specialized GPUs running for months. The output is a set of weights: billions of numerical parameters that encode the statistical relationships the model learned. Every capability the model ships with comes from this phase. Its grasp of language, its reasoning patterns, its knowledge of the world up to its training cutoff. The Stanford AI Index Report tracks published training cost estimates annually. Frontier model training runs now range from $100M to over $1B.
Who's behind it: Model vendors, exclusively. OpenAI, Anthropic, Google DeepMind, Meta, Mistral, and a small handful of others. The infrastructure requirements put this outside the reach of any enterprise buyer. This is R&D at industrial scale.
What makes it distinct: You will never sell a pre-training engagement. Your buyer will never fund one. Pre-training is what model vendors do before anyone else touches the model. When it comes up in conversation, it's almost always because someone used the word "train" when they meant something cheaper and faster.
Fine-Tuning
What it is: Adjusting an existing pre-trained model's weights using a smaller, domain-specific dataset so the model behaves differently.
What it does: Fine-tuning changes how the model responds: its tone, its output format, its conventions. It does not teach the model new facts. OpenAI's documentation draws this line explicitly: fine-tuning is for behavior. If the model needs current facts, product records, or a policy library that changes quarterly, the vendor's own guidance says to use retrieval.
You prepare a dataset of example inputs and desired outputs. OpenAI's fine-tuning guide sets the minimum at 10 examples, with meaningful improvement typically starting around 50-100. You upload the dataset, kick off a training job, and the job runs minutes to hours depending on size. The output is a new model variant accessible through the same API as the base model. Plan for 3-10 training runs to iterate toward the output quality you need.
Who's behind it: Enterprise teams working through vendor APIs. OpenAI, Azure OpenAI, and Google all offer fine-tuning as a service. Anthropic does not offer public fine-tuning through its standard API as of May 2026. Note that OpenAI's direct fine-tuning platform is winding down for new users, though existing users retain access and Azure OpenAI's fine-tuning product remains actively maintained. Azure OpenAI's cost management docs are the most detailed primary source for cost planning: training cost is calculated as training tokens × epochs × per-token rate (essentially: how much data × how many passes × the model's rate), plus an ongoing hosting fee of $1.70-$3.00/hour for the deployed model. That hosting fee runs whether the model is handling requests or sitting idle.
What makes it distinct: Fine-tuning changes the model's behavior without rebuilding it from scratch, and the data you train on becomes permanent. It gets baked into the weights. There's no database to update, no index to refresh. If your source data changes next quarter, you retrain or you live with stale behavior. That permanence is the defining characteristic of fine-tuning, and it's what makes the choice consequential.
All pricing figures are subject to change. Verify against current provider rate cards before using in a proposal.
Fine-tuning a model is analogous to writing custom authentication policies in your IdP — both take a general-purpose system and encode your organization's specific behavioral requirements. The analog breaks on reversibility: you can delete a conditional access policy and the IdP reverts to default. A fine-tuned model's weights are permanently altered. "Reverting" means redeploying the base model and losing the fine-tune entirely.
Inference
What it is: Running a trained model on live inputs to produce outputs, with the model's weights frozen.
What it does: Every time you send a prompt to a deployed model and get a response, that's inference. The model applies its frozen weights to your input and generates output token by token. You're billed per token, with input tokens (what you send) and output tokens (what you get back) priced separately. Current pricing spans a wide range: from $0.10 per million input tokens for lightweight models like GPT-4.1 Nano to $30 per million input tokens for premium reasoning models. Output tokens cost 3-12x more than input tokens depending on the model. Reasoning-intensive models add another billing layer: reasoning tokens are billed as output tokens even when they aren't shown in the response.
Who's behind it: Anyone with API access. Every chatbot, document summarizer, and code assistant in production is inference. The entire market of AI-powered applications runs here.
What makes it distinct: Inference is pure operational expense. No training runs, no weight changes, no data baked into anything. The model is a fixed asset and you're paying to use it. RAG happens at this layer: the retrieval step pulls relevant documents from an external store, stuffs them into the prompt as context, and the model generates a response using its frozen weights. The data stays external, updateable, and governable.
RAG at inference time works like an ABAC attribute lookup: the system queries an external data source at decision time rather than relying on static rules baked into the policy engine. The analog holds on architecture. It breaks on authorization: an ABAC lookup is governed by a PDP with explicit access policies. A basic RAG retrieval has no equivalent centralized authorization layer. That gap is where identity becomes relevant to AI architecture, but that's a later lesson.
Comparison by Buyer Scenario
I'm organizing this comparison around buyer scenarios rather than a flat trait table because the point of this piece is a conversation moment. The three regimes each answer a different question at a different altitude of the model lifecycle. Scenario mapping lets you recognize what the buyer is describing and know which regime it maps to, which is the skill that matters on Tuesday.
"We want to train it on our data."
One question resolves this: what data, and what do you want the model to do with it?
"Our policy documents, so the model can answer questions about them." That's inference with RAG. The data stays in a document store. The model retrieves relevant chunks at query time. Nothing is trained. Cost: per-token inference pricing plus the retrieval infrastructure. Timeline: days to weeks for the pipeline, then ongoing per-call costs.
"Our style guide and response templates, so the model writes like us." That's fine-tuning. You're encoding a behavior pattern into the weights. Cost: a realistic total project budget including engineering time for data prep and iteration runs is in the $500-$5K range for most use cases. Timeline: days to a few weeks including data prep and iteration.
"We want to build our own model from the ground up." That's pre-training, and unless your buyer is a national lab with a nine-figure compute budget, they don't mean it. Redirect.
"How much will this cost us?"
| Regime | Cost structure | Typical enterprise magnitude |
|---|---|---|
| Pre-training | Compute + data + research team + months of GPU time | $100M+ (what model vendors spend, not an enterprise line item) |
| Fine-tuning | Training tokens × epochs × per-token rate, plus hosting at $1.70-$3/hr ongoing, plus engineering time | $500-$5K for most projects; reinforcement fine-tuning runs higher (Azure caps RFT at $5K/run) |
| Inference | Per-token, per-call (input and output billed separately) | $0.10-$30 per million input tokens depending on model; scales linearly with usage |
The hidden cost in fine-tuning is hosting. Once you deploy a fine-tuned model, you pay for the deployment whether it's handling traffic or not. The hidden cost in inference is output and reasoning tokens, which run 3-12x more expensive than input tokens.
"How long until this is in production?"
Pre-training: months. You are not doing this.
Fine-tuning: the training job itself runs minutes to hours. The project, including data preparation, iteration, and evaluation, takes days to weeks.
Inference: immediate. You make an API call, you get a response. Deploying a RAG pipeline around inference takes days to weeks for the retrieval infrastructure, but the model is available now.
"What happens to our data?"
The "train it on our data" correction has real governance teeth here.
Fine-tuning bakes your data into the model's weights. NIST AI 600-1 flags data memorization as a specific risk of training: information included in training data can resurface at inference time. For public sector buyers handling sensitive data, fine-tuning is a data-governance event. There's no access control layer between the data and the model after training completes.
RAG at inference keeps data external. Documents live in a store you control, retrieved at query time, governed by whatever access controls surround that store. The data never enters the weights.
Pre-training data governance is the model vendor's problem, not yours. But it's worth knowing the risk exists: frontier models trained on public data have been shown to memorize and reproduce specific training examples.
Think of a model's weights as analogous to the trust store in a federation architecture. Pre-training builds the trust store from scratch. Fine-tuning adds or modifies specific trust relationships. Inference reads from the trust store but never writes to it. The analog holds on the read/write distinction. It breaks on auditability: modifying a trust store is a precise, traceable operation. Fine-tuning modifies weights in ways that are famously difficult to audit, which is exactly why NIST flags it as a risk surface.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "You'd need to train a model for that." | "That sounds like a retrieval problem. Your data stays in a document store and the model pulls from it at query time." | Redirects a $50K conversation to a $5K one and shows you know the difference. |
| "Fine-tuning is like training the model on your data." | "Fine-tuning changes how the model behaves — its tone, its format. It doesn't teach it new facts." | Prevents the buyer from expecting a fine-tuned model to know their latest policy updates. |
| "We could train it on your policies." | "We could set up retrieval against your policy library so the model references current documents at query time." | Keeps the data updateable and avoids baking sensitive content into model weights. |
| "Pre-training would let you customize it completely." | "Pre-training is what OpenAI and Anthropic do. It costs north of $100M. What you're describing is fine-tuning or retrieval." | Resets cost expectations by five orders of magnitude. |
| "The AI will learn from your data over time." | "The model's weights are frozen at inference. It doesn't learn from conversations. New information goes into the retrieval index." | Corrects the most common misconception about deployed AI systems. |
| "Fine-tuning is expensive." | "A fine-tuning project typically runs a few thousand dollars including data prep. The ongoing cost is hosting the deployed model, which bills hourly whether it's active or not." | Gives the buyer a real number instead of a vague warning. |
| "Inference is basically free." | "Inference is per-token. A budget model runs under a dollar per million tokens. A premium model can hit $30 per million inputs. Model selection is your biggest cost lever." | Prevents sticker shock when the buyer picks a premium model and the invoice arrives. |
| "NIST doesn't really cover AI training." | "NIST AI 600-1 uses pre-training, fine-tuning, and inference as its standard lifecycle vocabulary. Your compliance team is probably already working from it." | Positions you as someone who's read the governance docs the buyer's team is reading. |
| "Let's fine-tune it on your customer data." | "Fine-tuning bakes data into the model permanently. NIST flags data memorization as a specific risk. For sensitive data, retrieval keeps it external and governable." | Turns a technical choice into a governance conversation the CISO will care about. |
| "We can customize the model for your use case." | "There are two ways to customize: fine-tuning changes the model's behavior, retrieval changes what information it can access. Which problem are you solving?" | Forces the buyer to articulate what they actually need, which is where the deal gets specific. |
The model's weights are being built, being adjusted, or being read. That's the entire taxonomy. Everything else is a question of which one the buyer actually needs.
Things to follow up on...
- OpenAI fine-tuning wind-down: OpenAI's model optimization page now states the fine-tuning platform is no longer accessible to new users, though existing users retain access and Azure OpenAI's product remains active — worth tracking if a buyer asks about fine-tuning availability.
- NIST AI 600-1 data memorization: NIST's generative AI profile explicitly flags data memorization as a training-phase risk, noting that sensitive information in training data can resurface at inference time — a fact that sharpens the fine-tuning vs. RAG governance conversation.
- Anthropic's flat long-context pricing: As of March 2026, Anthropic eliminated long-context surcharges on Opus 4.7 and Sonnet 4.6, meaning a 900K-token request costs the same per-token rate as a 9K-token request — a material shift for teams comparing inference costs across providers.
- Azure reinforcement fine-tuning caps: Azure OpenAI's cost management documentation caps reinforcement fine-tuning jobs at $5,000 per run, automatically pausing training and creating a deployable checkpoint when the limit is reached — a useful guardrail to mention when buyers worry about runaway training costs.

