The Universal IdP You Can't Audit: Foundation Models and the Fine-Tuning Question

By Leigh Garrity— May 9, 2026

The Universal IdP You Can't Audit: Foundation Models and the Fine-Tuning Question

A foundation model is trained once, at enormous scale, and then adapted by everyone who uses it. That's the entire business model of the AI infrastructure layer — and it's why your customers can deploy AI without becoming AI labs.

Understanding that sentence precisely is worth more in a CIO conversation than anything else in this piece.

What a Foundation Model Actually Is

Start with the mechanism. A foundation model is a neural network trained on a massive corpus of text, code, documents, and web content to develop a general-purpose representation of language, knowledge, and reasoning. "Foundation" is doing real work in that name: this is the base layer. Everything your customers will do with AI sits on top of it.

The scale is the point. These models contain hundreds of billions of parameters, numerical weights that encode, in aggregate, what the model has learned. Training them requires compute budgets measured in tens of millions of dollars and months of runtime on specialized hardware clusters. No individual agency is going to do this. No mid-market enterprise is going to do this. The economics don't work, and the engineering talent required doesn't exist outside a handful of organizations.

What does exist at the agency level is the ability to adapt. The foundation model provider trains the base model. The agency configures it for their context — through prompts, through retrieval layers, through fine-tuning in specific circumstances. The training happens once, upstream. The adaptation happens at the tenant.

Train once, adapt everywhere. That's the structure your customers are buying into.

“

• Foundation model: A large neural network trained once at enormous scale to develop general-purpose language and reasoning capabilities. Organizations adapt it to specific tasks rather than training from scratch — the same way they configure shared infrastructure rather than building it.

The Adaptation Spectrum (and Where Fine-Tuning Actually Sits)

There are four ways to get a foundation model to do what you want. They're not equivalent, and the current consensus has shifted significantly on which ones you should reach for.

Prompting is the lightest touch. You give the model instructions — a system prompt, examples, constraints — and it adjusts its behavior without any changes to the underlying weights. This sounds simple, and it is. It's also more powerful than it was two years ago, because the models themselves have gotten better at following complex instructions.

Retrieval-augmented generation (RAG) adds a retrieval layer. At inference time, the model pulls relevant documents from an external store — your agency's policy documents, your knowledge base, your case files — and incorporates that content into its response. The model's weights don't change. The context does. This is how you give a model access to current, domain-specific information without baking it into the model itself.

Fine-tuning adjusts the model's weights on a new dataset. You take the foundation model, run a training process on your data, and produce a new model artifact with modified parameters. The model now "knows" something it didn't know before, in the sense that its weights have been updated to reflect patterns in your data.

Training from scratch is what the foundation model providers do. Nobody in your customer base is doing this.

Fine-tuning is overrated for most enterprise use cases. The territory it used to own has been absorbed by the other three approaches.

Two years ago, fine-tuning was the standard answer to "how do we make this model work for our domain?" Context windows were short — a few thousand tokens — which meant you couldn't inject much domain context at inference time. Prompting was less reliable. RAG tooling was immature. Fine-tuning was the mechanism that let you get domain-specific behavior out of a general-purpose model.

Context windows are now measured in hundreds of thousands of tokens. You can inject an entire policy manual into a single inference call. Prompting has gotten substantially more capable, particularly with structured system prompts and few-shot examples. RAG tooling has matured to the point where a competent engineering team can build a retrieval layer in weeks, not months.

Most of what people wanted from fine-tuning, they can now get from a well-designed RAG layer and a carefully written system prompt. Without the overhead.

That overhead is real. Fine-tuning creates a model artifact you now own and must maintain. When the foundation model provider releases an improved base model, your fine-tuned version doesn't automatically benefit — you have to retrain. You need to manage a training data pipeline, which means data quality, versioning, and governance. You've introduced a maintenance obligation that didn't exist before.

Fine-tuning is still justified in specific circumstances. If you need highly consistent style or tone at massive scale and prompting isn't reliable enough, fine-tuning can help. If your domain uses vocabulary or notation that genuinely doesn't appear in the foundation model's training data — specialized scientific notation, classified program names, proprietary taxonomies — fine-tuning can encode that. If you're running latency-sensitive applications where long context windows are too slow, a fine-tuned model with the knowledge baked in may be faster.

Those are real use cases. They're also genuinely uncommon in most enterprise deployments. Most "we need to fine-tune" conversations are actually "we need a retrieval layer and better prompts."

“

• Fine-tuning: A process that adjusts a foundation model's weights on a new dataset, producing a modified model artifact. It's the right tool for specific problems — consistent style at scale, genuinely novel domain vocabulary, latency constraints — but prompting and retrieval have absorbed most of the territory it used to own.

Okta Concept Mapping

The analogy: a foundation model is to your AI deployment what Okta is to your identity stack.

Your customers don't write their own authentication and authorization infrastructure. They configure a shared platform, one that handles the hard problems of credential management, federation, MFA, and session lifecycle, and they adapt it to their tenant's requirements through policies, attribute mappings, and integrations. The provider handles the infrastructure. The agency handles the configuration.

Foundation models work the same way. The provider trains the base model. The agency adapts it through prompts, retrieval layers, and occasionally fine-tuning. Nobody builds from scratch.

The analogy breaks at auditability.

With Okta, you can inspect the user store. You can audit attribute mappings. You can pull a log and trace exactly why a specific user got or didn't get access to a specific resource at a specific time. The system is opaque in some ways, but the data layer is auditable. When a CIO asks "why did this user get access to that system," there is a clean answer, traceable to specific configuration decisions.

With a foundation model, you cannot do this. The weights are opaque. You can observe behavior — you can test what the model says in response to specific inputs — but you cannot inspect what the model "knows" about your domain the way you can inspect an IdP's user store. When a CIO asks "what does this model know about our procurement policies," there is no clean answer. You can tell them what documents were in the training corpus, if the provider discloses that. You can tell them what documents your RAG layer retrieves. You cannot tell them how the model has encoded that information internally, or whether it has.

This opacity matters for procurement conversations. Agencies that ask "what does this model know about our domain?" are asking a reasonable question that the current technology cannot fully answer. That's not a reason to avoid foundation models. It is a reason to design adaptation strategies — RAG, primarily — that keep the auditable knowledge layer outside the model, where you can inspect it.

The Conversation You're Going to Have

A CIO asks: "We have ten years of policy documents and case files. Do we need to fine-tune a model on all of that?"

Probably not. A retrieval layer pulls the relevant documents at inference time. The model sees the actual text of the relevant policy, not a learned approximation of it. When the policy changes, you update the document store — you don't retrain a model. When the CIO asks why the model gave a specific answer, you can show them which documents were retrieved. The knowledge layer is auditable because it lives in a system you control, not in weights you can't inspect.

Fine-tuning that corpus would give you a model that has "learned" the policies in the sense that its weights have been updated to reflect patterns in the text. But you can't show the CIO which policy the model is drawing on. You can't update a policy without retraining. You've created a maintenance obligation and reduced auditability in exchange for a capability you could have gotten from a retrieval layer.

Fine-tuning proposals from vendors often come with significant professional services costs, training infrastructure requirements, and ongoing maintenance commitments. RAG implementations are not free, but they're more modular, more auditable, and more aligned with how government agencies actually manage information — as documents with versions and owners, not as patterns in a weight matrix.

When a buyer says "we need to fine-tune," the question to ask is: what specific behavior are you trying to get that prompting and retrieval can't give you? If they can answer that question precisely, fine-tuning might be justified. If the answer is "we want the model to really know our domain," that's a RAG problem, not a fine-tuning problem.

The foundation model is already configured for your tenant. The question is how you give it access to your context — and the answer, most of the time, is retrieval, not retraining.

“

• The practical frame: When a buyer reaches for fine-tuning, ask what specific behavior they're trying to achieve that prompting and retrieval can't deliver. Most of the time, the answer points back to a retrieval layer — which is auditable, maintainable, and doesn't require owning a model artifact.