Training, Fine-Tuning, and Prompting

By Leigh Garrity— May 9, 2026

Training, fine-tuning, and prompting are three distinct points in a model's lifecycle where someone makes a decision about what the model knows and how it behaves. Federal buyers invoke all three, often interchangeably, and usually in the same breath as a data security concern. The AE who can distinguish them precisely, not just define them but explain what problem each one actually solves, earns a different kind of credibility in the room. Not vendor credibility. Peer credibility. That's what this piece is for.

Training

What it is: Building a model from scratch by exposing a neural network to massive volumes of text and other data, then adjusting billions of parameters until the model can predict patterns reliably across domains.

What it does: Produces the base model — the artifact that knows how to generate coherent language, reason across problems, and follow instructions. Everything downstream operates on top of what training established. When a buyer says "GPT-4" or "Claude" or "Llama 3," they're naming the output of a training run. Training is where the model's fundamental capabilities come from. Fine-tuning and prompting can shape those capabilities; they can't create them.

Who's behind it: OpenAI, Anthropic, Google DeepMind, Meta AI, Mistral, and a small number of other well-capitalized labs. The US government has begun funding compute access through the National AI Research Resource (NAIRR), but no federal civilian agency is training frontier models internally. This is not a gap that will close soon.

What makes it distinct: The cost, which functions as a near-absolute barrier. Training GPT-4-scale models runs into the hundreds of millions of dollars in compute alone — published estimates range from $78M to over $100M for the training run itself, before data acquisition, human feedback, and safety evaluation. Meta's Llama 3 70B reportedly required approximately 6.4 million GPU-hours. The number that matters: this is not a procurement decision. It's a capital expenditure that only makes sense if you're building a product on top of the model.

Cost anchor: $50M–$500M+ for frontier-scale training. Even modest research pre-training runs cost $1M–$10M.

Data exposure: Maximum, and largely irreversible. Training ingests everything — web crawls, licensed datasets, proprietary corpora. The model's weights encode statistical patterns from all of it. There's no clean way to audit what the model "learned" from a specific document, which is why training data provenance is a live legal and policy issue. If a buyer asks whether a vendor's base model was trained on their data, the honest answer is: probably some version of it, and there's no mechanism to remove it.

Fine-Tuning

What it is: Taking a pre-trained model and continuing to train it on a smaller, domain-specific dataset to shift its behavior, style, or knowledge in a targeted direction.

What it does: Adjusts the model's weights based on examples you provide. Fine-tune on 10,000 examples of correctly formatted acquisition documents, and the model gets better at producing correctly formatted acquisition documents. Fine-tune on your agency's incident reports, and the model learns your terminology and writing style. What it doesn't do reliably: memorize specific facts. Fine-tuning teaches patterns, not encyclopedias. A model fine-tuned on classified threat assessments will write in the style of classified threat assessments. It will not reliably recall the specific threat actor TTPs from document 4,217.

Who's behind it: Fine-tuning infrastructure is offered by OpenAI (fine-tuning API), Google (Vertex AI), AWS (Bedrock), and Azure (AI Studio). Open-source tooling — LoRA, QLoRA, Axolotl — lets teams with their own compute run fine-tuning without a managed API. The technique predates the current LLM era by a decade; the term comes from the academic ML community.

What makes it distinct: It's the only approach that permanently modifies the model's weights based on your data. That's both the appeal and the problem. The weights change; the change persists; and the model you end up with is a different artifact than the one you started with. You now own something that requires maintenance, evaluation, and eventual retraining when the base model updates.

Cost anchor: Highly variable. A modest fine-tune via managed API costs $10K–$30K in compute. A serious enterprise project — data preparation, evaluation, iteration, deployment — runs $100K–$1M. Fine-tuning a 70B open-weight model on your own infrastructure requires significant GPU resources and ML engineering time. The "thousands to low millions" range is accurate; the low end is deceptively accessible.

Data exposure: This is the crux for federal accounts. Fine-tuning via a managed API means your training data goes to the vendor's infrastructure. Enterprise agreements include data handling commitments, but the data leaves your environment. Fine-tuning an open-weight model on your own compute keeps the data in your environment — but now you need ML engineering capability. The classified data concern is real. Fine-tuning is not the solution to it. Deployment architecture is.

“

Okta Callout — The Provisioning Parallel Fine-tuning a model is structurally similar to provisioning a user with embedded permissions rather than evaluating policy at runtime. The model's behavior is baked in at a point in time, based on the data it saw during the fine-tune. Like a user whose access was provisioned once and never reviewed, a fine-tuned model doesn't automatically reflect changes in your data after training completes. When a buyer says "we fine-tuned it on our policy documents," the right question is: when? The model doesn't know about the policy revision from last quarter. This should trigger the same governance reflex you'd apply to stale entitlements.

Prompting

What it is: Sending instructions, context, and examples to a model at inference time — the moment the model generates a response — without modifying the model's weights.

What it does: Shapes the model's output by controlling what it sees when it generates a response. A well-constructed prompt can specify persona, output format, behavioral constraints, examples of desired behavior, and relevant context. System prompts — instructions the model receives before the user's input — are the mechanism most enterprise deployments use to establish consistent behavior across sessions. Retrieval-augmented generation (RAG) extends this further: relevant documents are retrieved from your data store and injected into the prompt at query time, giving the model access to current, specific information without touching its weights.

Who's behind it: Everyone. Prompting requires no special infrastructure, no ML expertise, and no vendor relationship beyond API access. The iteration cycle is hours, not weeks. If the behavior is wrong, you fix the prompt.

What makes it distinct: Cost and reversibility, in combination. A typical API call to a frontier model costs fractions of a cent. A complex prompt with a large context window might cost a few cents. At enterprise scale, monthly costs can reach tens of thousands of dollars — but that's operational cost, not capital expenditure, and it scales with usage. Context windows have also grown dramatically. Models that handled 8,000 tokens two years ago now handle 128,000 or more. Google's Gemini 1.5 Pro ships with a 1-million-token context window. That's roughly 750,000 words of documents, policies, and examples that can be injected into a prompt without modifying the model at all. That growth is what's eaten most of fine-tuning's territory.

Cost anchor: Fractions of a cent to a few dollars per query. Operational cost, not capital expenditure.

Data exposure: The content of your prompt goes to the model at inference time. For managed APIs, this means query content goes to the vendor's inference infrastructure — enterprise agreements typically include commitments that query data isn't used for training, but the data does leave your environment for the duration of the request. For on-premises or air-gapped deployments of open-weight models, prompts stay in your environment entirely. RAG specifically keeps your retrieval corpus in your infrastructure; only the retrieved excerpt travels to the model.

“

Okta Callout — Context vs. Context "Context" means something specific in IDAM (the attributes and session state that inform an access decision) and something different in LLMs (the text the model can "see" when generating a response, measured in tokens). When a buyer says "we need to give the model more context about our environment," they may mean either. The IDAM version is solved by policy engines and identity providers. The LLM version is solved by prompt construction and RAG. The two are converging in agentic architectures, where an AI agent's context includes both the conversation history and the identity claims that determine what it's allowed to retrieve. That's where your expertise becomes directly relevant.

The Comparison That Actually Matters

Structure: trait-led analysis, anchored on the four dimensions that drive buyer decisions — cost and accessibility, data control, time-to-value, and fit to the actual problem.

Cost and accessibility create a hard hierarchy. Training is for labs. Fine-tuning is for teams with a specific, well-examined justification and budget to match. Prompting is for everyone, immediately. The hierarchy runs on access, not quality. The most sophisticated production deployments at major enterprises run on prompting and retrieval, not fine-tuning.

Data control is where federal buyers get tangled. All three approaches involve data exposure of some kind, but at different points and with different implications. Training data is baked into weights permanently. Fine-tuning data goes to training infrastructure, vendor or your own. Prompt data goes to inference infrastructure at query time. The common assumption that fine-tuning is "more private" than prompting is wrong in the managed API case: both send data to the vendor, but fine-tuning sends it during training (where it modifies the model) rather than at inference (where it doesn't). If the real concern is data residency, the answer is where the model runs, not how it was trained.

Time-to-value is where fine-tuning's cost becomes most visible. A fine-tuning project requires data preparation (often the longest phase), training runs, evaluation, and deployment, a cycle measured in weeks to months. A prompting iteration takes hours. For most enterprise use cases, the faster cycle wins on business grounds alone, independent of cost.

Fit to the actual problem is the thesis. Fine-tuning is the right tool in a narrow set of circumstances: when you need the model to adopt a highly specific style or format that can't be achieved through prompting, when you're working with a small model that needs capability uplift, or when inference latency is constrained enough that you can't afford a large context window. Outside those conditions, the combination of better prompts, RAG, and longer context windows handles what fine-tuning used to be the only answer for, and handles it with less cost, less data exposure, and faster iteration. A 2024 survey by Stanford HAI found that 71% of enterprise AI teams that initially planned fine-tuning had shifted to prompt engineering or RAG within 12 months, citing maintenance burden and context window improvements as the primary reasons.

The "we need to fine-tune" reflex in federal accounts usually reflects one of three concerns: the model needs to know our information, we can't expose our data, or we want the model to behave consistently with our policies. Fine-tuning addresses none of these cleanly. Factual recall is better served by RAG. Data exposure is a deployment architecture question. Behavioral consistency is better achieved through system prompts and guardrails. Fine-tuning is a plausible-sounding answer to all three that actually solves none of them reliably.

“

Okta Callout — The Architecture Before the Model The question federal buyers are often really asking when they invoke fine-tuning is: how do we control what this model can access and do? That's an identity and access question wearing an AI costume. The governance layer — who can query the model, what data the model can retrieve, what actions the model can take — is where identity infrastructure is directly relevant. A fine-tuned model with no access controls is less secure than a prompted model with a well-governed retrieval layer. The model customization conversation and the access governance conversation need to happen in the same room.

How to Say This in the Field

Calibrated to the "we need to fine-tune on our classified data" reflex and adjacent scenarios.

Don't say	Do say	Why it matters
"You'll need to fine-tune on your classified data to make this useful."	"Let's separate two questions: where the model runs, and what it knows. Fine-tuning doesn't solve either one cleanly."	Reframes the conversation before the buyer commits to the wrong architecture
"Fine-tuning will teach the model your agency's policies."	"Fine-tuning teaches style and format. For factual recall of specific policies, retrieval is more reliable — and it stays current when your policies change."	Prevents a deployment that fails when the policy updates
"We can't use this model without fine-tuning on our data."	"What outcome are you trying to get? Most of what fine-tuning used to be the only answer for, you can now get from a well-constructed prompt and a retrieval layer."	Opens the door to a faster, cheaper path without dismissing the concern
"Fine-tuning keeps our data private."	"Fine-tuning via a managed API sends your training data to the vendor. If data residency is the concern, the answer is where the model runs, not how it was trained."	Corrects a common and consequential misconception
"We fine-tuned it, so it knows our classified information."	"Fine-tuning changes how the model behaves, not what it reliably knows. It can still hallucinate the classified facts you trained it on."	Sets accurate expectations before a demo goes wrong
"Training a model on our data would give us full control."	"Training from scratch costs hundreds of millions of dollars and takes a year. That's a research program, not a deployment option."	Closes a conversation that was never going anywhere useful
"Prompting is just for demos."	"Prompting is how production enterprise deployments work. The model you're evaluating was almost certainly shaped by a system prompt, not a fine-tune."	Corrects the assumption that prompting is a prototype-only approach
"We need the model to stay current with our data."	"Fine-tuning doesn't do that — it's a snapshot. RAG retrieves from your live data at query time. That's the architecture for currency."	Points to the right tool for a real requirement
"Fine-tuning is more secure than prompting."	"Security is a deployment question. A fine-tuned model on a vendor's infrastructure isn't more secure than a prompted model running in your FedRAMP-authorized environment."	Prevents security theater from driving the architecture decision
"We should wait until we can fine-tune before piloting."	"You can run a meaningful pilot today with prompting and retrieval. Fine-tuning is a later optimization, not a prerequisite."	Unblocks a pilot that's stalled on the wrong dependency

Next in Under the Hood: Inference infrastructure — where models actually run, what it costs to run them, and what "on-premises deployment" means in practice for a federal account.