AI architecture has a vocabulary problem. Nobody in the sales cycle has much incentive to define terms precisely. Vendors want you to feel the capability, not interrogate the implementation. Buyers want to sound current. The result is a lot of conversations where "we're evaluating RAG" or "we're running inference on-prem" gets treated as a complete thought, and everyone in the room agrees to move on.
This section is the briefing that doesn't happen in vendor decks.
The Gap Is Specific
You already know what it feels like. A buyer's CTO mentions they're "hosting their own model on dedicated infrastructure" and you parse the words correctly but can't quite picture the thing. Someone says "we're using DeepSeek through Bedrock" and you understand that this involves a model and a cloud platform, but the architecture — what's actually running where, who controls what, what the failure modes are — stays fuzzy. You can hold the conversation. You just can't stress-test it.
That's not a knowledge failure. It's a vocabulary gap that the industry has been happy to leave open because vague vocabulary closes deals faster than precise vocabulary does. The problem is that precise vocabulary is what keeps deals alive past the proof-of-concept stage, when the buyer's infrastructure team starts asking questions that the original pitch wasn't designed to answer.
The gap is also smaller than it looks. The concepts that underlie most enterprise AI architecture conversations are not numerous. What a model physically is — what "weights" means as a concrete artifact rather than a metaphor. What inference actually costs, and why the cost structure is different from what you'd expect coming from SaaS. What "open source" means in this context, which turns out to be a more complicated question than it sounds. What RAG is as an architecture, not a buzzword. These are the load-bearing concepts. Everything else in the buyer conversation is built on top of them.
Note: In identity, a service provider connects to an identity provider through a live, running service — you configure a trust relationship with an endpoint that exists and responds. The closest AI equivalent is a hosted model API: a running service you call, with an endpoint, SLAs, and a vendor on the other side. It diverges here: AI models can also exist as static artifacts — files containing the model's learned parameters — that you download, run yourself, and operate entirely without the original vendor. The same "model" can be either thing depending on how you deploy it, which means questions about availability, support, and security have completely different answers depending on which deployment pattern you're looking at. Your SaaS mental model handles the first case cleanly. It doesn't map to the second at all.
What This Section Covers
The pieces that follow work through the mechanical reality that vendor decks skip. The actual physical and economic structure of how these systems run.
That starts with what a model is. Concretely: what the artifact looks like, what lives in the weights, why a 70-billion-parameter model requires the hardware it requires, and what it means when someone says they're "fine-tuning" versus "prompting" versus "deploying." These distinctions matter because buyers are making infrastructure commitments based on them, and the commitments are not interchangeable.
From there: inference economics. Running a model costs money in ways that don't behave like SaaS licensing costs, and the cost structure shapes what's feasible at scale. A buyer who says "we'll just run this ourselves" is making a claim about hardware, staffing, and operational overhead that may or may not be realistic, and you need enough grounding to know which questions to ask.
Then the open-source question, which is genuinely complicated. "Open source" in AI is applied to models with meaningfully different license terms, different levels of weight availability, and different implications for what the buyer can actually do with the thing. The word is doing a lot of work that it isn't always qualified to do.
And RAG — retrieval-augmented generation — which is the architecture pattern showing up in more enterprise AI conversations than anything else right now. It's not a product. It's a design pattern with real infrastructure requirements, real failure modes, and real questions about where the data lives and who controls it.
Note: In identity, you evaluate a vendor's trustworthiness partly through formal attestation — certifications, audit reports, federation metadata signed with verifiable keys. The closest AI equivalent is model provenance: where did this model come from, who trained it on what data, and what's actually in the weights. It diverges here: the attestation infrastructure for AI provenance is still being built. A model card is a document the model's creator wrote about themselves. There's no SAML metadata equivalent — no independently verifiable, cryptographically signed record of what a model was trained on or what it will do. When a buyer says "we need to understand the model's lineage," they're asking a reasonable question that the industry doesn't yet have a clean answer for.
What Fluency Actually Looks Like
The goal is to get you to the point where the architecture conversation is legible — where you can hear "we're doing RAG with a self-hosted embedding model and a managed inference endpoint" and understand what the buyer has committed to, what they've left open, and what questions would be useful to ask next.
That's a different kind of fluency than knowing the math. It's the fluency of someone who understands the system well enough to know where it can go wrong, which is exactly the fluency that builds credibility in a room full of people who are hoping nobody asks a follow-up question.
The concepts here are not difficult. They've just been systematically under-explained, because the people selling AI systems have found that under-explanation is commercially convenient. This section fixes that.

