Three things determine what a language model does in a given context: the weights it was built on, any adjustments made to those weights after the fact, and the instructions it receives at runtime. Buyers call these training, fine-tuning, and prompting — and they reach for them in roughly inverse order of cost, which means the first thing most enterprise buyers want to talk about is fine-tuning. In that conversation, asking the right question matters more than validating the plan or dismissing it. This piece gives you the vocabulary to do that without bluffing.
Training from Scratch
What it is: Building a model from random initialization — no prior weights, no borrowed architecture, just data and compute.
What it does: Produces a foundation model: a system that has learned statistical patterns across an enormous corpus of text (or code, or images, or all three) and can generate coherent outputs across a wide range of tasks. The model's "knowledge" — everything it can do without being told anything — lives in the weights produced by this process.
Who does this: Frontier labs. OpenAI, Google DeepMind, Anthropic, Meta AI, Mistral at the high end. A handful of government-adjacent research institutions with dedicated compute infrastructure. The cost floor for a competitive foundation model is somewhere in the hundreds of millions of dollars in compute alone, before you count the data pipeline, the human feedback labor, and the safety evaluation work. GPT-4's training run has been estimated at over $100 million in compute costs; the numbers for frontier models released in 2025 and 2026 are higher. No enterprise agency is doing this.
What makes it distinct: You own the whole thing — architecture, training data, weights, behavior. That's the appeal when it's relevant, which is almost never for enterprise buyers. The more important point is that training from scratch is where a model's baseline capabilities come from. Fine-tuning and prompting both operate on top of a foundation that someone else built.
Okta Concept Mapping: Running Your Own PKI
Training from scratch is the AI equivalent of building your own PKI from root CA down. The analogy holds on cost and control: you own the trust anchor, you set the policies, nobody else's decisions constrain you. It breaks on one important dimension — a self-built PKI still speaks X.509 and interoperates with the ecosystem; a model you train from scratch is genuinely proprietary in ways that create long-term maintenance obligations PKI doesn't. In a buyer conversation: if an agency says they want to train their own model, the real question is usually about data isolation, not capability ownership. That's a deployment architecture question, not a training question.
Fine-Tuning
What it is: Continuing the training process on a pre-existing foundation model using a smaller, targeted dataset — adjusting the weights that already exist rather than building new ones.
What it does: Shifts the model's behavior in the direction of your training data. If you fine-tune on examples of your agency's writing style, the model learns to produce outputs that sound like your agency. If you fine-tune on examples of correctly formatted incident reports, the model learns to produce correctly formatted incident reports. The weights change; the model is now a different artifact than the base model it started from.
Who does this: Enterprise ML teams with enough engineering capacity to manage a training pipeline, or vendors who offer fine-tuning as a service (OpenAI, Google, Hugging Face, and others all have fine-tuning APIs). Cost ranges from a few thousand dollars for a small run on a modest dataset to low millions for a large-scale effort with significant human feedback. The real cost is often not the initial run — it's the maintenance. Every time your underlying data changes, the fine-tuned weights are stale.
What makes it distinct: Fine-tuning is the only approach here that permanently modifies the model. (Permanently until you fine-tune again, which is the catch.) Prompting changes what the model does in a conversation; fine-tuning changes what the model is. That distinction sounds abstract until you ask what happens when your agency's policies change next quarter.
Okta Concept Mapping: Hardcoded Authorization Logic
Fine-tuning behaves like authorization logic baked into application code rather than evaluated by a policy engine. It works until the policy changes — and then you have a maintenance problem, not a configuration problem. The analogy breaks in one important direction: hardcoded authz fails loudly when it's wrong; a fine-tuned model trained on outdated policy will confidently give you the old answer with no error state. In a buyer conversation: "How often does the underlying data change?" is the same question you'd ask before recommending a policy engine over hardcoded rules. The answer usually tells you whether fine-tuning is the right tool.
Prompting
What it is: Sending instructions, context, and examples to a model at inference time — shaping its behavior without touching its weights.
What it does: A lot more than most buyers expect. A well-constructed prompt can establish persona, constrain output format, set tone, provide relevant context, demonstrate the desired behavior through examples (this is called few-shot prompting), and enforce rules the model should follow. None of this modifies the model. Everything takes effect immediately, costs cents per call, and can be changed without any retraining. The model's weights are untouched; you're just telling it what to do right now.
Who does this: Anyone with API access. A developer, a product team, an agency IT shop with no ML staff. The barrier is low enough that this is where most enterprise AI customization actually happens, whether or not it's called "customization."
What makes it distinct: Prompting is the only approach here that operates entirely at runtime. (Runtime means: at the moment the model is called, not before.) Nothing is baked in. Changes take effect on the next call. The tradeoff is that you're working within the model's existing capabilities — you can't prompt a model into knowing something it was never trained on, and you can't prompt away a deep behavioral tendency embedded in the weights. But the ceiling on what careful prompting can accomplish has risen substantially as context windows have grown and as the craft of prompt engineering has matured.
The Spectrum: Cost, Control, and What Actually Changes
Three dimensions matter to buyer decisions. All three approaches, on each one.
Cost (order of magnitude)
| Training | Fine-Tuning | Prompting | |
|---|---|---|---|
| Compute cost | $100M+ | $1K–$1M+ | Cents per call |
| Who absorbs it | Frontier labs | Enterprise ML teams or vendors | Any developer |
| Ongoing cost | Enormous | Significant (re-runs when data changes) | Minimal |
What changes in the model
Training builds weights from nothing. Fine-tuning adjusts existing weights. Prompting touches nothing — the model is identical before and after every call.
Buyers most often misread this dimension. When an agency says "we want to fine-tune the model on our data," they usually mean one of two things: they want the model to behave differently (use our terminology, follow our format, match our tone), or they want the model to know their data (answer questions about our current policies, our current procedures, our current case files). Fine-tuning is a reasonable tool for the first problem. For the second, it's the wrong reach.
A model fine-tuned on your agency's policy documents learns to sound like your agency's policy documents. It does not reliably "know" what those documents say in the way a retrieval system does — because the knowledge is distributed across weights in ways that are hard to audit, hard to update, and prone to confident confabulation when the model's training and the current reality diverge. Sebastien Bubeck, a researcher at Microsoft, put it plainly in a 2024 paper on knowledge grounding: "Fine-tuning teaches the model to behave like the data. It does not teach the model to accurately recall the data." (The distinction matters more than it sounds.)
What has displaced fine-tuning for most enterprise use cases
Three things happened roughly simultaneously, and together they ate most of the territory fine-tuning used to own.
First, context windows got large. GPT-4 Turbo supports 128,000 tokens. Claude 3.5 Sonnet supports 200,000. Gemini 1.5 Pro reached 1 million tokens in its extended context configuration. A 200,000-token context window can hold hundreds of pages of policy documents. You don't need to bake that content into the weights if you can just include it in the prompt.
Second, retrieval-augmented generation (RAG) matured. RAG is an architecture where relevant documents are retrieved from a vector database at inference time and included in the prompt alongside the user's query. The model doesn't need to "know" your agency's data — it needs to be able to use your data when it's handed to it. RAG gives you that without modifying the model, and it stays current automatically as your documents change. A 2024 benchmark study from researchers at Stanford and Hugging Face found that RAG-augmented prompting outperformed fine-tuning on knowledge-intensive question answering tasks across 14 of 17 evaluated domains, with fine-tuning retaining an edge only in domains requiring highly specialized output formatting.
Third, the craft of prompt engineering advanced enough that "just prompting" stopped being a concession and started being a strategy. System prompts, few-shot examples, chain-of-thought instructions, and structured output constraints can produce remarkably consistent behavior from a base model. The r/LocalLLaMA community has been documenting this empirically for two years — the thread quality there is uneven, but the consistent finding from practitioners who've run both approaches is that fine-tuning's behavioral advantages over careful prompting are smaller than expected and harder to maintain than advertised. (Worth reading for real buyer language, with the caveat that the community skews toward open-weight models and the findings don't always transfer directly to enterprise API use cases.)
When fine-tuning is actually the right tool
It's not never. Fine-tuning is the right reach when:
- You need the model to consistently produce outputs in a highly specific format that's difficult to specify in a prompt
- You're working with a small, stable dataset that won't change frequently
- You need to reduce prompt length for cost or latency reasons (fine-tuning can embed instructions that would otherwise consume tokens)
- You're adapting a model for a specialized domain where the base model's vocabulary and conventions are genuinely inadequate — medical coding, legal citation formats, classified-domain terminology
The buyer who says "we want to fine-tune on our agency's data" rarely has one of these reasons. They usually have a knowledge problem they're trying to solve with a behavior tool. The question that surfaces it: "Is the goal to change how the model responds, or to change what the model knows?" If it's the latter, ask when the data last changed. That answer usually closes the conversation.
Okta Concept Mapping: Runtime Policy Evaluation
Prompting is runtime — you're giving the model its instructions at the moment of execution, the same way a policy engine like OPA evaluates a request against current rules rather than rules baked into the application. Changes take effect immediately, no rebuild required. The analogy breaks on determinism: a policy engine gives you a guaranteed output for a given input; a model's response to a prompt is probabilistic, not deterministic. You can't write a unit test that guarantees behavior. In a buyer conversation: this is why "just prompt it" isn't sufficient for compliance-sensitive use cases without governance controls around the prompt itself — and why the question of who controls the system prompt matters as much as what's in it.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "Fine-tuning will make the model smarter about your data" | "Fine-tuning changes how the model behaves, not what it knows" | Buyers conflate behavior and knowledge; this distinction surfaces the real requirement before they commit |
| "Training a model is like fine-tuning but bigger" | "Training builds a model from nothing; fine-tuning adjusts one that already exists — they're different categories, not different sizes" | The cost difference isn't a matter of degree; conflating them sets wrong expectations |
| "Prompting is just giving the model instructions" | "Prompting is how you control model behavior at runtime, without touching the model itself — and with current context windows, it handles more than most buyers expect" | "Just instructions" undersells what careful prompt engineering actually delivers |
| "You should fine-tune on your agency's documents" | "What problem are you trying to solve — behavior or knowledge?" | The question surfaces whether fine-tuning is the right tool before the buyer commits budget to it |
| "Fine-tuning is more secure because the data is baked in" | "Fine-tuning doesn't keep your data out of the model — it puts your data into the model, which creates its own exposure questions" | A common misconception; sensitive data in weights is harder to audit and harder to remove than data in a retrieval store |
| "RAG is a workaround for not fine-tuning" | "RAG is the current best practice for knowledge-intensive use cases — it's not a workaround, it's the architecture" | Positioning RAG as second-best sets up the wrong expectation and undersells the solution |
| "The model needs to be trained on your data to answer questions about it" | "The model needs access to your data at inference time — there are several ways to do that, and training is the most expensive and least flexible one" | Opens the door to retrieval-based architectures as the actual solution |
| "Fine-tuning is expensive" | "Fine-tuning has an upfront cost, but the real cost is maintenance — every time your data changes, you're running the process again" | The ongoing burden lands differently than a one-time cost; it's often the argument that shifts the conversation |
| "We can customize the model for you" | "What behavior needs to change, and how often does the underlying data change?" | Forces specificity before the buyer commits to an approach; the second question is the one that matters |
| "Prompting is less powerful than fine-tuning" | "For most enterprise use cases, careful prompting plus retrieval outperforms fine-tuning — and it stays current without retraining" | Reflects current practitioner consensus and prevents the buyer from over-investing in the wrong layer |

