Buyers building AI systems are making one of two bets about their content. Either it's stable enough to index — chunk it, embed it, store it, retrieve by similarity when a query comes in — or it moves too fast to index reliably, so you skip the index and read the current truth on every query. The first pattern is RAG. The second is what the agentic-coding community calls direct retrieval, and it's showing up in your calls whether or not buyers name it that way. Knowing which pattern a buyer has built tells you what failure mode to expect when something goes wrong.
The Two Patterns
RAG — Retrieval-Augmented Generation
What it is: A retrieval pattern that indexes your knowledge base once and queries the index repeatedly.
What it does: Documents get chunked into pieces, each chunk gets embedded into a vector representation, and those vectors get stored in a vector database. When a query comes in, the system retrieves the chunks most semantically similar to the question and passes them to the model as context. The model answers from that retrieved context, not from its training weights alone.
Where it comes from: The pattern was formalized in a 2020 paper from Facebook AI Research — Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" — though the underlying intuition predates it: give the model relevant context at query time rather than trying to bake everything into weights. Enterprise AI vendors picked it up fast because it solved a real problem: how do you give a model access to proprietary knowledge without retraining it?
What makes it distinct: The index is the bet. You're betting that the cost of building and maintaining it is worth paying because you'll query it enough times to amortize. When that bet pays off, RAG is efficient and semantically capable. When it doesn't, you've built an expensive liability.
Direct Retrieval
What it is: A retrieval pattern that reads the current state of your content on every query, with no intermediate index.
What it does: grep, filesystem reads, live API calls — whatever gets the model to the current truth without going through an index. The model asks "what does this file say right now" instead of "what did this file say when I last indexed it." There's no embedding step, no vector store, no similarity search. You get exact matches and current state. You don't get semantic search, and you pay the read cost every time.
Where it comes from: This is how search worked before inverted indexes, and it's how most shell scripting still works. What's new is applying it deliberately in agentic AI systems as an architectural choice, not a limitation. The agentic-coding tools made this explicit: Anthropic's Claude Code documentation describes the tool as reading files directly rather than maintaining an index, because codebases change continuously and a stale index gives the model confident wrong answers about current code state.
What makes it distinct: No index means no staleness. It also means no semantic search, no amortization, and no free lunch — you pay the read cost on every single query. The tradeoff is explicit and intentional.
Okta analog — Universal Directory as a stable attribute store: RAG's fitness condition maps cleanly to how Okta's Universal Directory works in practice. The directory is a stable, repeatedly-queried knowledge base — user attributes, group memberships, application assignments. Policy evaluation reads from it constantly, and the read cost amortizes across millions of authentications. Where the analog breaks: when group membership or entitlements change faster than the cache refreshes, you get the same failure mode as a stale RAG index. The model (or the policy engine) answers confidently from data that no longer reflects reality.
Comparison: One Variable Does Most of the Work
I'm using trait-led analysis here, organized around a single variable: content velocity relative to query volume. That's the variable that determines which pattern wins. Both subjects appear on every dimension.
Content velocity
RAG assumes your content is stable relative to your query volume. If you're indexing a policy library, a product knowledge base, or a document archive — things that change on a weekly or monthly cadence — the index stays accurate long enough to be useful. The indexing cost is a fixed investment that pays down over time.
Direct retrieval assumes the opposite. Codebases are the canonical example because they make the failure mode vivid: a developer modifies a function, and if the RAG index hasn't been rebuilt since that change, the model will describe the old function with full confidence. It's not hallucinating — it's accurately describing a document that no longer exists. That's a harder failure to debug than a hallucination, because the retrieved chunk is real. It just describes the past.
Claude Code's architectural choice was explicit about this reasoning. The tool reads files directly because the alternative — maintaining an index over an actively-developed codebase — produces a system that's confidently wrong about current state. The read cost on every query is the price of accuracy.
Query volume and amortization
RAG's economics only work if you query the index enough times to justify building it. For a large enterprise knowledge base queried by hundreds of users daily, the math is obvious. For a small, frequently-updated dataset queried occasionally, you're paying index-maintenance overhead for a marginal retrieval benefit.
Direct retrieval has no amortization curve. The cost per query is roughly constant. That's a disadvantage at high query volume against stable content, and an advantage when the alternative is maintaining an index that's perpetually stale.
Semantic capability
RAG retrieves by semantic similarity. You can ask "what's our policy on contractor access to production systems" and get relevant chunks even if no document uses exactly those words. This is useful for knowledge-base queries where the question and the answer use different vocabulary.
Direct retrieval gives you what you ask for, literally. grep finds exact matches. Filesystem reads return file contents. If the relevant information uses different terminology than the query, you miss it. Some agentic systems compensate by having the model generate multiple search terms and run several reads — but that's complexity added back in to recover a capability that RAG provides natively.
Latency and scale
RAG's latency profile is predictable: vector similarity search is fast, and the retrieval step doesn't scale with corpus size the way a full-text scan does. You can have a million-document knowledge base and still get sub-second retrieval.
Direct retrieval's latency scales with corpus size and query complexity. Reading a handful of files is fast. Reading a large codebase on every query is not. There's genuine disagreement in the engineering community about where the ceiling is — whether direct retrieval scales past, say, a 500,000-line codebase depends heavily on how the agent structures its reads and how much of the corpus it actually needs to touch per query. The honest answer is that the production data on this is thin, and anyone claiming a definitive number is probably extrapolating from benchmarks.
Okta analog — live LDAP queries at authentication time: Direct retrieval maps to how some legacy environments still handle group membership lookups — a live LDAP query at authentication time, not a cached attribute assertion. The accuracy argument is identical: you get current group membership, not what was true when the session token was issued. The scale argument follows: this works fine at moderate volume and becomes a latency problem at enterprise scale. Caching solves the scale problem and introduces a freshness problem. That's the same architectural evolution, same tradeoff, different vocabulary.
Failure Modes
Buyers often don't know which architecture they've built until something breaks. The failure modes look different, and knowing which one you're looking at changes the conversation.
RAG failures
Stale index. The most common RAG failure. Content changes after the last reindex, and the model answers from outdated chunks. From the outside, this looks like hallucination — the model says something confidently wrong. The diagnostic question: when did you last reindex, and what changed since then? If the answer is "we reindex weekly" and the content changes daily, you've found the problem.
Chunk boundary artifacts. The relevant information spans a chunk boundary. Neither chunk has enough context to answer the question. The model says "I don't have information about that" when the information is clearly in the knowledge base. This is a chunking strategy problem, not a model problem, but it presents as a model problem.
Retrieval misses on exact-match queries. Semantic similarity search is optimized for conceptual proximity, not exact string matching. If a user needs to find a specific policy number, a contract clause with precise wording, or a configuration value, RAG may retrieve semantically adjacent chunks that don't contain the exact string. Hybrid search approaches address this, but that's a separate conversation.
Direct retrieval failures
Scale ceiling. Reading files on every query works until it doesn't. Latency spikes, context windows fill up, and the agent starts making selective decisions about what to read — which means it starts missing things. The failure is gradual and hard to detect until query quality degrades noticeably.
Exact-match blindness. grep doesn't know that "contractor" and "vendor" might refer to the same thing in your policy documents. Direct retrieval misses anything that requires semantic understanding to find. For codebases, this is usually fine — you know what you're looking for. For knowledge bases with inconsistent terminology, it's a real gap.
Context window pressure. Pulling raw file contents into the model's context is expensive. A well-designed RAG pipeline retrieves the relevant chunks; direct retrieval retrieves everything and lets the model sort it out. At scale, this creates pressure on context window limits and increases inference cost.
Okta analog — token claims vs. live entitlement checks: A JWT with embedded group claims is RAG: entitlements were indexed at token issuance, the token gets queried repeatedly until expiry, and the application answers from the claims in the token. A live entitlement check at every authorization decision is direct retrieval: current truth, every time, at the cost of a round trip to the authorization server. Freshness vs. efficiency, stale claims authorizing actions the user is no longer entitled to take. SIEM teams in federal environments know this failure mode well. The AI architecture conversation is the same problem with different vocabulary.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "The AI is hallucinating" | "This looks like a stale index problem — when did they last reindex?" | Points to a fixable architecture issue, not a model deficiency |
| "RAG is the standard approach" | "RAG pays off when the content is stable and queried repeatedly" | Frames it as a fitness condition, not a default |
| "They're using vector search" | "They've built a RAG pipeline — vector search is one component of it" | Vector search is a part; RAG is the pattern |
| "Direct retrieval doesn't scale" | "Direct retrieval trades scalability for freshness — the question is which one they need more" | Accurate tradeoff framing instead of a dismissal |
| "Their AI reads the files" | "They're using direct retrieval — no index, current state on every query" | Names the pattern and implies the tradeoffs |
| "When did they last train the model" | "When did they last reindex" | Training and indexing are different operations; conflating them signals confusion |
| "The model doesn't know about recent changes" | "Their retrieval layer hasn't caught up — the model weights aren't the issue" | Localizes the problem correctly |
| "Agentic tools use a different approach" | "Agentic coding tools like Claude Code skip the index by design — it's a deliberate choice for fast-moving content" | Names the deliberate architectural decision, not just a difference |
| "They need to update the AI" | "They need to reindex — or reconsider whether RAG is the right pattern for this content" | Precision; opens the fitness-condition conversation |
| "RAG is better for enterprise" | "RAG is better when the content is stable; direct retrieval is better when it moves faster than you can reindex" | The only accurate framing of the tradeoff |
The question a buyer is really answering when they choose between these patterns is: how often does my content change relative to how often it gets queried? Rarely changes, queried constantly — RAG is right. Changes constantly and current truth matters — the index is a liability. Most enterprise knowledge bases live clearly in the first camp. Most codebases live clearly in the second. The interesting cases are in the middle — compliance documents that update quarterly, security policies that change after incidents, configuration data that drifts — and those are the cases where the buyer probably hasn't thought hard about which bet they're making.
When they haven't thought about it, the failure mode is usually the same: they built RAG because RAG is what you build, and they're debugging what looks like hallucination but is actually a freshness problem. Worth being ready for that conversation.

