Every enterprise AI conversation eventually hits the same wall. The model is capable. The model is fast. But the model doesn't know your agency's current contractor rates, the exception your general counsel added to the standard terms last quarter, or the policy revision that went into effect after the last budget cycle. It knows what it was trained on, and your data wasn't in the training set.
RAG solves that by giving the model access to your data at query time — fetching the relevant pieces, handing them to the model as context, and letting it reason from there. No retraining required.
The Loop
RAG has four steps, and they run in sequence every time a user asks a question.
Embed. Before any query happens, your documents are converted into embeddings: numerical vectors that represent semantic meaning. An embedding is a coordinate in a very high-dimensional space. Documents that mean similar things end up near each other in that space. A policy document about contractor payment terms and another about contractor invoice processing will be neighbors. A document about physical security will be somewhere else entirely.
Index. Those vectors get stored in a vector database. Systems like Pinecone, Weaviate, and pgvector are representative of the current landscape. The index is a snapshot of your document corpus at the moment of ingestion.
Retrieve. When a user submits a query, the query gets embedded using the same model that embedded the documents. The system then finds the vectors most similar to the query vector — the top-k results, where k is typically somewhere between 3 and 20. These are the chunks the system judges most relevant.
Generate. Those retrieved chunks get passed to the language model as context. The model reads them and generates a response. What the model knows about your enterprise comes entirely from what was retrieved in this step.
Vector similarity search became the default retrieval mechanism because keyword search breaks down on meaning. A query about "how contractors get paid" won't match a document that uses "vendor disbursement" throughout. Embeddings capture semantic proximity rather than lexical overlap, which makes them far more useful for unstructured enterprise content.
RAG became the standard pattern for grounding models in proprietary knowledge because fine-tuning bakes knowledge into the model weights at a point in time, costs significant compute, and requires retraining every time the knowledge changes. RAG externalizes the knowledge. You update the index, not the model. For public sector environments where policy documents change on legislative cycles and the audit trail matters, that separation matters operationally.
Okta Concept Mapping
RAG most resembles a federated attribute query: a system that doesn't hold information locally queries an external source at runtime and reasons from what it gets back. The analogy holds for the shape of the interaction. It breaks on trust. In federated identity, the attribute provider is authoritative and you know when the token is expired. In RAG, the index is a similarity approximation with no TTL, no revocation, and no signal to the model that a retrieved chunk is stale or contextually wrong.
Where It Breaks
Stale indexes. The index is a snapshot. If the underlying documents change after ingestion and the index isn't refreshed, the model retrieves outdated information and presents it with full confidence. There's no equivalent of token expiry. The model has no way to know the document was superseded last month.
Chunking failures. Documents get split into chunks before indexing, typically by token count or paragraph boundary. When a chunk boundary falls between a policy statement and its exception, or between a table header and its data, the chunk loses meaning. The retrieval system finds it because it's similar to the query. The model reasons from a fragment. The output looks authoritative and is wrong in ways that are hard to catch after the fact.
Retrieval mismatches. Top-k retrieval returns the most similar vectors, not the most relevant documents. Similarity in embedding space is a proxy for semantic relevance, and it's an imperfect one. A query about contractor payment disputes might retrieve chunks about contractor security clearance disputes — both are about contractor disputes, so the embeddings will be neighbors. The model gets plausible-looking wrong context and reasons from it confidently.
Naive, Iterative, and Agentic
Single-pass RAG — one retrieval, one generation — is the baseline. Fast and simple, and it fails in the ways described above with no mechanism for recovery.
Iterative RAG adds a feedback loop: the system retrieves, generates an intermediate result, uses that result to refine the query, and retrieves again. This handles cases where the first retrieval was directionally right but missed nuance.
Agentic retrieval goes further. The agent decides when to retrieve, what to retrieve, and whether the results are sufficient before proceeding. That's Lesson 5 territory, so I'll leave it there.
The Conversation That Matters
When a CAIO says "we're building a RAG system on top of our policy documents," the question that signals fluency isn't about the model. It's about the index. How often do those documents change? What triggers a refresh? Who owns that process? What happens when a policy is updated but the index isn't?
The model is the part everyone talks about. The index is the part that determines whether the answers are trustworthy. Your buyer knows the difference between a system that's accurate and one that sounds accurate. The question about index freshness is how you show them you do too.

