What It Is, Precisely
RAG is an inference-time architecture that retrieves relevant document chunks from an external index and injects them into the model's context window before generation. The model doesn't search anything. It reads what you hand it and generates a response based on that content, combined with its training. The retrieval happens outside the model; the generation happens inside it. Those are two separate systems with two separate failure modes, and conflating them is where most RAG conversations go wrong.
The Loop
Embed. Before any query runs, your documents get converted into vectors — high-dimensional numerical representations that encode semantic meaning. A chunk of text about FISMA compliance and a chunk about FedRAMP authorization will land near each other in this vector space because they're conceptually adjacent, even if they share no keywords. An embedding model does this conversion. The documents themselves aren't stored in the model; only their vector representations are.
Index. Those vectors get stored in a vector database — think of it as a spatial index optimized for similarity search rather than exact match. A moderately sized policy library might produce hundreds of thousands of vectors. The index makes it possible to search them in milliseconds.
Retrieve. When a user submits a query, the same embedding model converts that query into a vector. The system then finds the document vectors closest to the query vector — typically the top five to twenty chunks, depending on how the system is configured. "Closest" means semantically similar, not keyword-matched. This is the step where most production failures originate.
Generate. The retrieved chunks get inserted into the model's context window alongside the original query. The model reads both and generates a response. It doesn't know whether the retrieved chunks are correct, current, or complete. It knows only that they're there.
The whole loop adds latency — usually a few hundred milliseconds for retrieval — but it means you can update your knowledge base without retraining the model. That's the trade that made RAG the default pattern for enterprise knowledge applications.
Okta Concept Mapping
RAG resembles claims-based federated access more than anything else in the IDAM stack. In federation, you don't embed all of a user's permissions into a token at issuance — you retrieve relevant claims at the point of need from an authoritative source. RAG does the same thing for knowledge: retrieve what's relevant at inference time rather than baking it into the model at training time. Both patterns solve the same underlying problem — you can't know everything in advance.
The analogy breaks on authority. Federated claims are authoritative. The IdP asserts them; the SP trusts the assertion because the trust chain is cryptographically established. RAG retrieval is probabilistic. The system retrieves what's similar to the query, not what's correct. There is no signed assertion. The model cannot verify that a retrieved chunk actually answers the question — it can only incorporate the chunk into its response. A hallucination that cites a retrieved-but-wrong document is harder to catch than a hallucination from nothing, because it looks sourced.
Where It Breaks
Stale indexes. The index reflects the documents as they existed when they were last embedded. If a policy was superseded last month and the index hasn't been refreshed, the model answers from the old version — confidently, with a citation. In federal environments where policy documents change on congressional or regulatory cycles, index freshness is an operational question, not a technical one. Someone has to own it.
Chunking failures. Documents get split into chunks before embedding — typically 256 to 512 tokens, with some overlap. If the answer to a question spans a chunk boundary, neither chunk alone is sufficient, and the model may generate a plausible bridge between two incomplete pieces of information. The chunking strategy is an engineering decision made at index time, with significant downstream effects on retrieval quality. Most buyers don't know this decision was made.
Retrieval mismatches. The query vector and the correct document vector may not land near each other if the question is phrased differently from the document. A user asking "what's the process for handling a data spill" may not retrieve a document titled "Classified Information Spillage Procedures" if the embedding space doesn't map those phrasings as similar. This is the vocabulary mismatch problem that plagued keyword search for decades, wearing different clothes.
The Variants
Naive RAG runs one retrieval pass and hands the results to the model. That's the baseline, and it's what most "AI on our documents" implementations actually are.
Agentic RAG can issue multiple retrieval calls, deciding what to search for next based on what it found — useful when answering a question requires assembling information from several sources. Iterative RAG refines the query between passes, using intermediate results to improve subsequent retrieval. Both variants require tool-calling mechanics covered in Lesson 5. They exist because naive RAG's single-pass retrieval has quality ceilings it can't clear on its own; running the loop more than once is how those ceilings get raised.
What to Listen For
When a buyer describes an AI system that "searches our documents," they're describing RAG. The questions that matter: How often is the index refreshed? Who owns that process? How were the documents chunked, and has anyone tested retrieval quality against the actual queries users are submitting? A system that can't answer those questions isn't broken yet. It's just waiting.

