You're past the prototype. The question on the table is how to run this in production, and the answer determines cost structure, latency profile, and maintenance burden for the life of the deployment. Two strategies, different economics, different operational profiles — and one capability that's widely misunderstood in ways that cost agencies real money.
Prompting and fine-tuning are production choices with different payoff profiles depending on what you're actually trying to optimize. Knowing the difference precisely — not approximately — is what separates a credible conversation from one that gets corrected by the agency's CTO in the second meeting.
The Subjects
Prompting as a production strategy is the use of carefully constructed input — system instructions, context, examples, constraints — to shape model behavior at inference time, without modifying the model's underlying parameters.
What it does in practice: you control behavior through what you put in the context window on every call. The model is unchanged; the instructions travel with each request. Lesson 1 covered context assembly, so the mechanics are familiar. What matters here is the production profile: prompting is runtime configuration. Change the prompt, change the behavior. No deployment cycle, no retraining, no infrastructure beyond what you're already running.
The strategy comes from the same place as the models themselves — OpenAI, Anthropic, Google, and the open-source community have all invested heavily in making frontier models prompt-responsive, because prompt-based customization is the path of least resistance for most enterprise deployments. The OpenAI system prompt specification, Anthropic's Constitutional AI guidance, and the emerging Model Spec documents are all, in part, attempts to make prompting more predictable at production scale.
What makes it distinct: zero upfront cost, maximum flexibility, and the ability to iterate in hours rather than weeks. The tradeoff is that every call carries the full weight of your instructions, which has direct implications for token cost and latency.
Fine-tuning is the process of continuing the training of a pre-trained model on a curated dataset, adjusting the model's weights to change its default behavior.
What it does in practice: you run a training job on examples of the behavior you want — input/output pairs, typically — and the resulting model has those behavioral patterns baked into its weights. The fine-tuned model responds differently from the base model without needing explicit instructions on every call.
The strategy originates in transfer learning, a technique that predates large language models by years. The core idea: a model trained on a massive general corpus has learned representations that can be adapted to specific tasks with far less data than training from scratch. OpenAI offered fine-tuning on GPT-3 as early as 2021; the open-source ecosystem (Hugging Face, EleutherAI, Meta's LLaMA releases) has made fine-tuning accessible to organizations that can't afford frontier model API costs at scale.
What makes it distinct: behavioral consistency without runtime instruction overhead, and the potential to run a smaller, cheaper model that performs comparably to a larger model on a narrow task. The tradeoff is upfront training cost, a retraining cycle for every behavioral update, and a capability that is genuinely different from what most buyers think it is.
Comparison
I'm using trait-led analysis across four dimensions: cost structure, latency impact, knowledge behavior, and maintenance burden. The alternative — scenario mapping — would bury the knowledge behavior point, which is the dimension where the most consequential misunderstanding lives. Each strategy appears on every dimension.
Cost Structure
Prompting: No upfront training cost. You pay per token at inference time, and cost scales directly with context length and call volume. A well-constructed system prompt for a government document processing task might run 800–1,200 tokens per call. At GPT-4o pricing (roughly $5 per million input tokens as of early 2026), that's manageable at moderate volume and expensive at scale. The economics are linear and predictable, which makes budgeting straightforward but makes high-volume deployments expensive.
Fine-tuning: Upfront training cost, then potentially lower per-inference cost. A LoRA fine-tuning run on a 7B-parameter model using a cloud provider's training infrastructure might cost $200–$800 for a reasonably sized dataset. The payoff: a fine-tuned smaller model can often match a larger prompted model on narrow tasks, at a fraction of the per-call cost. The economics work when call volume is high enough and the task is narrow enough. At low volume, the training cost never amortizes. At high volume with a well-scoped task, fine-tuning can cut inference costs by 60–80% compared to a frontier model with a long system prompt.
The crossover point is real but task-dependent. There's no universal answer; the calculation requires knowing your call volume, your task complexity, and your tolerance for the maintenance burden that comes with fine-tuning.
Latency Impact
Prompting: Latency scales with context length. A 1,500-token system prompt adds measurable processing time to every call. For interactive applications — a citizen-facing chatbot, an analyst assistant — this matters. For batch processing, it usually doesn't. The latency profile is consistent and predictable; you know what you're paying on every call.
Fine-tuning: Can reduce per-call latency by eliminating the need for long system prompts — the behavior is baked in, so the input is shorter. But if the fine-tuned model is smaller than the prompted frontier model it's replacing, you're trading prompt length for model capability, and the quality tradeoffs may introduce retry logic or downstream correction steps that add latency back. Fine-tuning doesn't automatically mean faster; it means different latency characteristics that require measurement against your specific task.
Knowledge Behavior
This is where the production decision gets misread most often, and where the misread is most expensive.
Prompting: The model's knowledge comes from pre-training. Prompting adds context — documents, data, instructions — that the model reasons over at inference time. The model doesn't "learn" from what you put in the prompt; it processes it. Retrieval-augmented generation (RAG) works on exactly this principle: retrieve relevant documents, include them in the context window, and the model reasons over them. The knowledge is in the context, not in the model. Remove the context, and the model doesn't remember it.
Fine-tuning: Fine-tuning adjusts how the model behaves — its style, its format, its tone, its task-specific patterns. It does not inject new factual knowledge into the model's weights in any reliable, auditable way. A model fine-tuned on your agency's policy documents will learn to respond in a style consistent with those documents. It will not become a reliable repository of their content.
Worth being precise about why. Pre-training encodes knowledge by exposing the model to billions of tokens of text, with the weight adjustments happening across the entire parameter space over many training epochs. Fine-tuning is a much smaller intervention — typically a few thousand to a few hundred thousand examples, adjusting a fraction of the parameters, for a fraction of the training time. The model's factual knowledge is largely fixed at pre-training. Fine-tuning shapes behavior on top of that foundation; it doesn't rebuild the foundation.
If your buyer wants the model to "know" their agency's current policies, fine-tuning is not the answer. RAG is the answer — retrieve the relevant policy documents at inference time and include them in the context. Fine-tuning is the answer if they want the model to respond in a specific format, maintain a consistent tone, follow a specific output structure, or perform a narrow task more reliably than the base model does by default.
Maintenance Burden
Prompting: Update the prompt, change the behavior. Iteration cycles are measured in hours. The risk is prompt drift — when a model provider updates the underlying model, prompt behavior can shift in ways that require re-testing and re-tuning. This is a real operational concern for production deployments, and it's underappreciated. You don't own the model; you own the prompt. Model updates are outside your control.
Fine-tuning: Every behavioral update requires a retraining cycle. When the base model is updated by the provider, your fine-tuning may need to be redone — the adapter weights trained on an older version of the model may not transfer cleanly to the new version. For organizations running on open-source models they control, this is manageable. For organizations fine-tuning on top of a provider's model, it introduces a dependency on the provider's release cadence. The maintenance burden is substantially higher than prompting, and the iteration cycle is measured in days to weeks, not hours.
PEFT and LoRA: Enough to Evaluate a Vendor Claim
Full fine-tuning adjusts all of a model's parameters. For a 70-billion-parameter model, that means updating 70 billion numbers — a compute-intensive, expensive process that requires significant GPU infrastructure.
Parameter-efficient fine-tuning (PEFT) is a family of techniques that achieve most of the behavioral benefit of fine-tuning by adjusting only a small fraction of the parameters. Most of the behavioral change you want can be approximated without touching the entire model.
LoRA (low-rank adaptation) is the most widely deployed PEFT technique. Instead of modifying the model's weight matrices directly, LoRA adds small adapter matrices alongside the existing weights. These adapters are trained to approximate the weight changes that full fine-tuning would produce, using far fewer parameters — typically less than 1% of the full model's parameter count. You get a fine-tuning run that costs a fraction of the full approach, produces a model that behaves similarly to a fully fine-tuned model on the target task, and can be swapped in and out without modifying the base model.
LoRA changes the economics of fine-tuning. The knowledge behavior point is unchanged. A LoRA adapter trained on your agency's documents still optimizes style and format, not factual knowledge injection. The efficiency gain is real and meaningful for cost calculations; the knowledge behavior limitation is not something LoRA addresses.
When a vendor says "fine-tuned on your data" or "custom model trained for your use case," the question to ask is: what behavioral outcome are you optimizing for? If the answer is "the model will know our data" — that's a RAG problem, not a fine-tuning problem, and the vendor's framing is either imprecise or misleading. If the answer is "the model will respond in a consistent format and style appropriate for our use case" — that's a fine-tuning problem, and LoRA is a legitimate, cost-effective path to it.
When the Frontier Model Wins
A well-prompted GPT-4-class model outperforms a fine-tuned smaller model when the task requires reasoning depth that the smaller model's base capability can't support.
Complex policy analysis, multi-document synthesis, novel scenario interpretation, tasks where the fine-tuning data didn't adequately cover the edge cases the model will encounter in production — these are all cases where the smaller model's ceiling becomes visible. Fine-tuning a 7B-parameter model on 10,000 examples of document summarization produces a model that summarizes documents in a consistent format. It does not produce a model that can reason about a novel regulatory question it hasn't seen before.
A fine-tuned smaller model earns its place when the task is narrow, well-defined, high-volume, and format-consistency matters more than reasoning depth. Classifying support tickets. Extracting structured data from a specific document type. Generating templated responses to common queries. These are tasks where the smaller model's limitations don't matter because the task doesn't require what the smaller model lacks.
The buyer who says "we need to train our own model" is often describing a task in the first category while imagining the economics of the second. Ask what success looks like on a hard case. If the answer requires reasoning, the frontier model with good prompting is probably the right starting point, and fine-tuning is a later optimization if inference cost becomes a constraint.
Field Language Guide
| Don't say | Do say | Why it matters |
|---|---|---|
| "Fine-tuning will teach the model your data" | "Fine-tuning shapes how the model responds — style, format, consistency. For the model to reason over your data, you need retrieval." | Buyers conflate fine-tuning with RAG; this redirects without correcting |
| "You need to train your own model" | "What outcome are you optimizing for — behavior consistency, inference cost, or knowledge access? Each has a different solution." | Unpacks the real requirement before committing to a technical path |
| "Fine-tuning is more accurate" | "Fine-tuning is more consistent on narrow tasks. Accuracy on knowledge-intensive tasks depends on what's in the context, not the weights." | Accuracy and consistency are different claims with different implications |
| "LoRA makes fine-tuning affordable" | "LoRA reduces training cost significantly. The knowledge behavior is the same as full fine-tuning — it's an economics improvement, not a capability change." | Prevents LoRA from being oversold as a knowledge injection solution |
| "A fine-tuned model will know your policies" | "A fine-tuned model will respond in a style consistent with your policies. For reliable policy retrieval, you want RAG with your policy documents as the source." | The distinction between behavioral style and factual knowledge is the central one |
| "Prompting doesn't scale" | "Prompting has a different cost profile at high volume — that's when fine-tuning economics start to make sense, if the task is narrow enough." | Avoids false dichotomy; makes the tradeoff concrete |
| "We'll fine-tune on your classified data" | "Fine-tuning on classified data requires an environment cleared for that data — that's an infrastructure question before it's a model question." | Surfaces the procurement and security constraint before the technical one |
| "GPT-4 is overkill for this" | "For this task, a smaller fine-tuned model may match GPT-4's output at lower cost — but that's worth validating on your hard cases before committing." | Keeps the frontier model as the benchmark, not the default to replace |
| "The model will learn from your feedback" | "RLHF and fine-tuning are offline training processes. The deployed model doesn't update from production interactions unless you run a new training cycle." | Prevents the buyer from expecting real-time learning |
| "We can update the fine-tuned model easily" | "Updating a fine-tuned model requires a retraining cycle — days to weeks, not hours. Prompting updates are faster to iterate." | Maintenance burden is a real procurement consideration |
IDAM Callout
The Policy Rule vs. The Compiled Configuration
Prompting behaves like a policy rule evaluated at runtime: interpretable, updatable without a deployment cycle, and applied fresh on every request. Fine-tuning behaves more like a compiled identity provider configuration — the behavior is baked into the artifact, faster to execute per call, but requiring a full deployment cycle to change. The analog holds reasonably well for cost and maintenance intuitions. Where it breaks: in IDAM, you can audit a policy rule and know exactly what it will do given a specific input. A fine-tuned model's behavioral changes are distributed across millions of weight adjustments — you can observe the outputs, but you can't inspect the mechanism the way you'd inspect a XACML policy. This matters in a buyer conversation when the CAIO asks "how do we audit what the model has learned?" The honest answer is that fine-tuning is less auditable than prompting, not more — which is the opposite of what most buyers assume when they imagine "training on our data" as a path to control.

