What Retrieval-Augmented Generation Actually Does

How the embed-index-retrieve-generate loop actually works, where it breaks in production, and which questions matter in the room.

By Leigh Garrity— May 8, 2026

What Retrieval-Augmented Generation Actually Does

How the embed-index-retrieve-generate loop actually works, where it breaks in production, and which questions matter in the room.

What RAG is

Most conversations about AI focus on what the model says. The engineering problem that determines whether the answer is grounded or fabricated is what the model reads before it says anything. That's where RAG lives.

Retrieval-Augmented Generation is a pattern that fetches relevant documents at query time and feeds them to a language model so its response is grounded in specific source material. The model doesn't learn anything new. It reads before it answers, the way you'd pull up a KB article before responding to a customer's question. RAG exists because language models are fluent, confident, and perfectly willing to fabricate. Giving them source material to cite is the primary production mechanism for keeping answers tethered to reality.

RAG defined: A pattern that retrieves relevant documents at query time and feeds them to a language model, grounding the model's response in specific source material rather than its training data alone.

How the loop works

Four steps, running in sequence every time a user asks a question.

Chunk. Source documents get split into pieces. A typical chunk runs 128 to 512 tokens, roughly two to three paragraphs of normal prose. The chunk is the unit of retrieval. Everything downstream operates on chunks, not whole documents. Keep that in mind. It matters.

Embed. Each chunk gets converted into a vector: a list of numbers that represents the chunk's meaning in a form that math can operate on. A modern embedding model produces roughly 1,500 to 3,000 numbers per chunk. The numbers encode relationships between concepts, which is why "federal compliance requirements" and "government regulatory obligations" end up with similar vectors despite sharing zero words. These vectors go into a vector database. That's your index.

Retrieve. The user's question gets embedded into the same vector space. The system finds the chunks whose vectors are closest to the query vector. In a basic setup, you grab the top 3 to 10 closest chunks. In production, you might pull the top 50, then rerank against the original question using a cross-encoder model and keep the best 5. Reranking is one of the most effective ways to improve retrieval quality. The tradeoff: a simple vector search takes milliseconds; a cross-encoder reranker can add seconds and cents per query.

Generate. The retrieved chunks get inserted into the prompt alongside the user's question. The language model reads them and writes a response. From the model's perspective, this is just reading and answering. It doesn't know a retrieval pipeline assembled the reading material. End-to-end latency for the full loop runs 1 to 3 seconds in production.

That's the loop. Chunk, embed, retrieve, generate. Every RAG system runs some version of it, from a startup chatbot to the Pentagon's GenAI.mil platform, which explicitly uses RAG to ground Gemini's responses for 3 million DoD personnel.

The RAG loop: Chunk documents into pieces, embed each piece as a numerical vector, retrieve the vectors closest to the user's query, feed those chunks to the language model, generate a grounded response.

Where naive RAG breaks

The loop looks clean on a whiteboard. In production, three failure modes eat it alive.

Chunking severs meaning. A fixed-size chunker doesn't know where ideas begin and end. It splits on token count, not on semantic boundaries. A policy document that says "Contractors must complete security training within 30 days of onboarding" can get split so that one chunk says contractors need training and the next chunk says something about 30 days and CISO approval. Neither chunk contains the complete requirement. Research confirms that optimal chunk size is query-dependent, and "getting it right" means different things for different queries against the same corpus.

Similarity search returns the wrong "close enough." Vector similarity measures conceptual proximity, which turns out to be a weaker guarantee than actual relevance. A query about "access revocation procedures" might retrieve chunks about "access provisioning procedures" because the vectors are neighbors in embedding space. They share the concept of access management. They answer completely different questions. This is the embedding similarity miss, and it's particularly vicious because the system has no signal that it happened. The retrieved chunks look relevant to the model. The model generates a confident, well-structured, wrong answer. Nobody in the pipeline raises a hand.

Single-shot retrieval has no error recovery. Naive RAG retrieves once and generates. If the retrieval step missed the right chunks, there's no second pass, no mechanism to say "that doesn't look right, let me search again with different terms." One practitioner blog estimates a ~40% miss rate in production, though no methodology is published for that number and it should be treated as directional, not precise. Even if the real figure is half that, it names a real problem.

Anthropic's research on contextual retrieval, which prepends explanatory context to each chunk before embedding, reduced retrieval failure rates by 35 to 49% depending on configuration. That improvement tells you the baseline failure rate was high enough to justify the engineering investment.

Naive RAG failures: Fixed-size chunking severs meaning at arbitrary boundaries, embedding similarity returns conceptually adjacent but wrong results, and single-shot retrieval has no mechanism to detect or recover from a miss.

Agentic retrieval and the retry loop

Agentic retrieval adds one capability that naive RAG lacks: it can look at what it found and decide to search again.

An orchestrating agent examines initial results, evaluates whether they actually answer the question, and can reformulate the query, search a different index, or request additional context before the language model ever sees the chunks. Mechanically: an analyst asks about access revocation procedures for departing contractors. The first retrieval pass returns chunks about access provisioning, onboarding workflows, contractor badge issuance. The agent evaluates: these are about granting access, and the question was about removing it. It strips the broad "access management" framing, reformulates the query around "revocation" and "offboarding" specifically, and retries. The second pass returns the deprovisioning policy, the 24-hour revocation SLA, the ISSO notification requirement. Those chunks go to the model.

The cost is latency and compute. Each retry is another embedding, another search, another evaluation pass. The benefit is something resembling self-correction, at least for the retrieval step. The generation step still has no ground truth to check against. That's a separate problem entirely.

Agentic retrieval: An orchestrating agent evaluates retrieval results and can reformulate queries, search additional indexes, or retry, adding error recovery that naive single-shot RAG lacks.

The filesystem-access shift

A newer pattern is starting to surface in architecture conversations, and it abandons the vector store entirely. Tools like Claude Code don't use vector stores at all. The agent reads source files directly from the filesystem at runtime.

An arXiv analysis of Claude Code describes the mechanism precisely:

“

"The system does not use embeddings or a vector similarity index for memory retrieval; instead it uses an LLM-based scan of memory-file headers to select up to five relevant files on demand."

Every piece of context the agent sees is a plain-text file a human can read, edit, and version-control. No opaque retrieval layer deciding what's relevant. The design principle is explicit: stored context should be inspectable and editable by the user. That's a deliberate trade against the flexibility of embedding-based lookup.

Anthropic's own documentation draws the boundary: for knowledge bases under roughly 500 pages, they recommend loading material directly into the context window and skipping retrieval entirely. Above that, you need RAG or something like it.

	Vector-Store RAG	Direct File Access
Scale	Millions of documents	Bounded by context window
Inspectability	Low. Retrieval is opaque.	High. Files are readable.
Infrastructure	Indexing pipeline + vector DB	Files on disk
Staleness	Requires re-embedding on updates	No index to go stale
Authorization	Must be built explicitly	OS-level file permissions

That last row matters for identity. It also gets complicated fast: OS-level file permissions work for a developer on a local machine, which is Claude Code's primary context. In a multi-user federal environment where hundreds of analysts need differentiated access to the same document corpus, "OS-level permissions" is not an authorization architecture. The authorization story for filesystem-access patterns at enterprise scale is genuinely unsettled. Nobody has a clean answer yet.

Filesystem-access pattern: Agents read source files directly at runtime instead of querying pre-indexed vector stores, trading scale for inspectability and eliminating the embedding pipeline. The authorization model for this pattern in enterprise environments remains an open question.

Okta Concept Mapping: RAG Retrieval as an Access Control Problem

Your OAuth intuition helps here: RAG retrieval is a system accessing resources on behalf of a user. The analogy breaks because OAuth scopes are enforced at the resource server with a known requestor identity, while a vector database has no native concept of "who is asking." RAG authorization must be enforced before the embedding search executes, and must stay synchronized with source-system ACLs that change dynamically.

The permission logic that lived in SharePoint or Confluence doesn't travel with the chunk into the embedding index. If a user with limited access asks a question, naive RAG will retrieve and surface documents they were never supposed to see, wrapped in a confident answer that feels completely legitimate. AWS's security guidance states it plainly: LLMs should be considered untrusted entities because they do not implement authorization as part of a response. The production-safe architecture requires authorization filtering before the similarity search runs. The OWASP 2025 Top 10 for LLM Applications added Vector and Embedding Weaknesses as a new category specifically because of this failure mode.

When you'll need this

You're in a meeting with a federal CAIO or CISO. They're describing a knowledge management system that lets analysts query policy documents, internal memos, and case files using natural language. They call it "AI-powered search" or "intelligent document retrieval." What they're describing is RAG.

Three questions that matter in that room:

"How does the system handle documents with different access levels?" This is the question that stalls deals. If CUI and unclassified documents are embedded in the same vector store without authorization-aware retrieval, the system will surface controlled information to unauthorized users. Federal agencies may need paragraph-level classification rather than document-level, and no standard authorization framework handles that natively today. This is a genuine open problem. Saying so is more credible than pretending it's solved.

"Who owns the RAG index?" GSA's draft AI clause explicitly asserts government ownership of RAG indexes, prompt libraries, and integration scripts built under contract. If your buyer is building a RAG pipeline on agency data, the index itself is a data sovereignty question. Most sellers haven't internalized this yet.

"What happens when the source documents change?" In a vector-store RAG system, updated documents need to be re-chunked and re-embedded. Until they are, the system retrieves stale content and generates answers based on outdated policy. The filesystem-access pattern sidesteps this but only works at smaller scale. Ask the buyer what their re-indexing cadence is. If they don't have one, that's your opening.

You don't need to build RAG systems. But recognizing these as the right questions, and knowing that the buyer asking them is technically serious, is the difference between following the conversation and losing it.

Practical context: In public sector conversations, RAG surfaces as "AI-powered search" over policy documents and case files. The deal-critical questions are about access control at the retrieval layer, data sovereignty over the index, and staleness when source documents change.

Next in this series: how fine-tuning differs from RAG, and why the choice between them is an architectural decision with real tradeoffs.

Things to follow up on...

Contextual retrieval mechanics: Anthropic's research on prepending chunk-specific context before embedding reduced retrieval failure rates by up to 49%, and the full writeup walks through the cost and configuration tradeoffs worth understanding before your next architecture conversation.
Federal RAG security primer: Federal News Network published a May 2026 piece that maps each stage of a RAG pipeline to a specific data exposure risk, from ingestion through generation, with federal-specific framing your buyers will recognize.
GSA's draft AI clause: The upcoming MAS refresh includes language asserting government ownership of RAG indexes and prompt libraries built under contract, and Burr & Forman's analysis breaks down what that means for vendors building on agency data.
OWASP vector weaknesses: The 2025 OWASP Top 10 for LLM Applications added Vector and Embedding Weaknesses as a new category, and Protecto's breakdown of the failure modes connects them directly to the access-control-bypass patterns that show up in enterprise RAG deployments.