When a user authenticates, your stack doesn't embed their entire employment history in the session token. It queries an authoritative source at runtime — the directory, the HR system, the entitlement store — pulling exactly the attributes the policy needs, at the moment it needs them. The model stays small; the context arrives on demand.
Retrieval-Augmented Generation (RAG) is that same structural move applied to language model inference. Instead of encoding proprietary knowledge into model weights at training time, you retrieve it at query time from an indexed document store. The model stays general; your organization's knowledge arrives when the question asks for it.
The Loop
RAG has four stages, and understanding what each one does is the only way to understand why it fails.
Embed. Every document in your corpus (policy manuals, contract templates, case notes, whatever constitutes the proprietary knowledge you want the model to use) gets converted into a numerical vector by an embedding model. The vector represents semantic meaning, not keywords. Two sentences that say the same thing in different words will produce similar vectors. Two sentences that use the same words to mean different things will not.
Index. Those vectors get stored in a vector database alongside the original text chunks they represent. Common chunk sizes run around 256 to 512 tokens, roughly a paragraph. The index is a snapshot of your corpus at the moment of ingestion.
Retrieve. When a user submits a query, that query gets embedded using the same model, producing a query vector. The vector database runs a similarity search, typically returning the top three to five chunks whose vectors sit closest to the query vector in the embedding space. These chunks become the context the model will reason over.
Generate. The retrieved chunks get injected into the model's prompt alongside the original query. The model generates a response grounded in that retrieved context rather than relying solely on what it learned during training.
The whole loop adds latency, typically 200 to 800 milliseconds for the retrieval step depending on index size and infrastructure, but it means the model can answer questions about documents it never saw during training, and it means you can update the knowledge base without touching the model.
Okta Concept Mapping
RAG's retrieve step maps cleanly to federated attribute lookup: at runtime, go out to an external authoritative source for context the local system doesn't hold. The analogy holds for the architecture. It breaks on the trust relationship. In federated identity, the query is deterministic; the same query against the same directory returns the same attributes, and "authoritative" means something precise. In RAG, "authoritative" is replaced by "semantically similar," which is a probabilistic judgment made by a vector similarity function. The retrieved content isn't the right answer; it's the closest answer the index found. That distinction matters every time someone asks why the model said something confidently wrong.
Where the Loop Breaks
Three failure modes account for most RAG problems in production.
Stale indexes. The index is a snapshot. If a policy document is updated after ingestion, the model retrieves the old version and generates a response based on superseded content. The model has no way to know the document changed. In environments where policy changes frequently, and in federal agencies policy changes constantly, index freshness isn't a configuration detail. It's a governance question. How often does the index rebuild? Who owns that process? What's the SLA between a document update and index propagation?
Chunking failures. Documents get split into chunks before indexing, and the split boundaries matter more than most implementations acknowledge. A chunk that contains the answer to a question but not the context that makes the answer interpretable will get retrieved and will produce a confident, incomplete response. The classic case: a numbered list where the question asks about item 7, the chunk contains item 7, but the chunk doesn't contain the list header that explains what the items are. The model sees a fragment and generates around it. Chunk size and overlap parameters are not defaults to accept without review.
Retrieval mismatches. Embedding similarity finds chunks that are semantically close to the query. Close is not the same as relevant. A query about contract termination clauses might retrieve chunks about contract renewal clauses because the embedding space treats them as neighbors. The model receives plausible-looking context that doesn't actually answer the question, and generates a plausible-looking response that doesn't either. This failure mode is the hardest to catch in evaluation because the output reads fluently.
Naive RAG vs. Iterative Retrieval
The loop described above is naive RAG: one retrieval pass, one generation. The query goes in, the top-k chunks come back, the model generates. Simple, fast, and limited by what a single similarity search can surface.
Iterative retrieval, sometimes called multi-hop retrieval, allows the model to issue follow-up queries based on what the first retrieval returned. If the initial chunks establish that the answer depends on a second document, the model can retrieve that document before generating. The result is better grounding for questions that require synthesizing across multiple sources. The cost is latency and complexity: more retrieval passes, more opportunities for the failure modes above to compound.
When a vendor proposes a RAG implementation, the question worth asking is which pattern they're using and why. Naive RAG is appropriate for straightforward question-answering against a well-structured corpus. Iterative retrieval fits when queries require multi-document reasoning, and it introduces its own failure surface that a single-pass evaluation won't catch.
What Your Buyer Actually Means
When a CAIO says "we're grounding our AI in our policy documents using RAG," they're describing an architecture that is sound in principle and fragile in practice. The principle is correct: don't encode volatile knowledge in model weights, retrieve it at runtime. The fragility lives in the index maintenance process, the chunking strategy, and the retrieval quality, none of which are visible in a demo.
The questions that surface real implementation maturity: How often does the index rebuild, and who owns that trigger? What's the chunking strategy, and was it validated against the actual query distribution? How is retrieval quality measured in production, not just at launch?
These separate a RAG deployment that works on day one from one that still works on day ninety.

