"Zero Data Retention" Is a Contract, Not an Architecture

By Carey Whitten— May 5, 2026

"Zero Data Retention" Is a Contract, Not an Architecture

That gap is why Sensitive Information Disclosure ranks second in the 2025 OWASP LLM Top 10. The risk itself isn't new; what's changed is that enterprise buyers are applying contractual instruments to a problem with technical dimensions those instruments don't reach.

What the Risk Category Actually Covers

Sensitive Information Disclosure, as OWASP defines it, is the class of failures in which an LLM system surfaces data — PII, proprietary information, credentials, confidential communications — to a party not authorized to receive it. The disclosure can flow toward external users, across tenant boundaries, or back to the vendor's infrastructure. The mechanism is data that entered the system for one purpose persisting in a form that allows retrieval for another purpose — not a model hallucinating sensitive content from thin air. Three architectural pathways produce this outcome, and they operate independently.

The Three Pathways

Chat logs are the most legible pathway. A user submits a query containing a Social Security number, a contract term, or a patient identifier. The platform logs the conversation — for abuse detection, quality review, or debugging. The model processes the query and returns a response. The log persists in vendor infrastructure. The user's intent was to get an answer; the data's actual trajectory included a write operation to a logging system the user never saw. Contractual no-training clauses address this pathway directly, and retention windows constrain it further. It is also the pathway most amenable to prompt-boundary DLP, which can intercept sensitive content before it reaches the model at all.

Embeddings are where the "just numbers" intuition fails. When an enterprise deploys retrieval-augmented generation against internal documents — HR policies, contracts, customer records — those documents are chunked and encoded as high-dimensional vectors stored in a vector database. The common assumption is that this transformation is one-way: the source text is gone, replaced by floating-point arrays that carry semantic meaning but not recoverable content. Research published by Morris et al. in 2023 demonstrated that this assumption is wrong for practical embedding models. Source text can be reconstructed from embeddings with meaningful fidelity, particularly for shorter segments. A vector store containing embeddings of sensitive documents is a sensitive data store. It looks like a database of numbers; it functions like a compressed archive of the source material. Standard DLP controls don't inspect it. Retention windows often don't apply to it. Tenant isolation is the primary control, and its effectiveness depends entirely on whether the vector store enforces strict namespace separation with access controls — not just logical partitioning that a misconfiguration can collapse.

Fine-tuning datasets present the hardest problem. When an enterprise fine-tunes a hosted model on proprietary data, that data is encoded into the model's weights. Unlike a database record, it has no address, no row identifier, no deletion API. The data is distributed across billions of parameters as a statistical residue of the training process, with no discrete location to target. Membership inference attacks can determine with meaningful confidence whether a specific record was in the training set. Extraction attacks can sometimes elicit verbatim training data through crafted prompts. The GDPR right-to-erasure obligation, applied to a fine-tuned model, has no clean technical implementation. You cannot delete a row from a model. The only reliable control is not fine-tuning on sensitive data in the first place — which is a design constraint, not a runtime control.

The Controls, Mapped Honestly

Prompt-boundary DLP works against chat log leakage and nothing else. It doesn't address documents already embedded in a vector store, and it doesn't reach fine-tuning datasets. Its effectiveness against chat logs depends on classifier quality; regex-based approaches miss contextual sensitivity, and ML-based classifiers add latency and introduce their own model risk.

Contractual no-training clauses address the fine-tuning pathway for hosted models. The vendor commits in writing not to use your data to train or improve their systems. This commitment is real and enforceable. Verifying compliance, though, is another matter — the customer is trusting the vendor's internal audit controls and whatever contractual remedies exist if the commitment is breached.

Tenant isolation addresses cross-contamination — your data reaching another tenant's context through shared inference infrastructure or a shared vector store. The control is architecturally sound when implemented correctly. Correct implementation means separate vector stores or strict cryptographic namespace separation, not just logical partitioning that a misconfiguration can collapse.

Retention windows address chat log persistence. A 30-day retention window is a real control. Embeddings, which persist as long as the vector store exists, fall outside it. Fine-tuning contributions have no retention window in any meaningful sense.

The ZDR Gap

Zero-data-retention claims circulate in two forms that sound identical and mean different things.

A contractual ZDR commitment means the vendor has agreed in writing not to retain your data beyond the session. This is enforceable through contract. In FedRAMP-authorized deployments, it sits inside an audit framework with defined controls and third-party assessment. For non-FedRAMP SaaS, the customer is relying on the vendor's word and whatever audit rights the contract provides.

Technical ZDR means the architecture is designed so that data cannot be retained even if the vendor wanted to. This requires stateless inference infrastructure, no logging at the infrastructure layer, and architectural guarantees that survive a misconfiguration. Very few hosted LLM providers can make this claim credibly. The ones who can are typically on-premises or private cloud deployments where the customer controls the infrastructure.

For a regulated environment, the distinction carries weight because NIST SP 800-53 Rev 5 control families covering data protection — SC-28 (Protection of Information at Rest) and SI-12 (Information Management and Retention) — require technical controls, not policy commitments. An auditor asking how data is protected at rest needs a technical answer. A contract exhibit is supporting evidence, not a primary control.

When a buyer in a regulated environment asks about zero-data-retention, the question worth asking back is: which pathway does that commitment cover, and what's the verification mechanism? That answer will tell you whether you're looking at a contractual commitment, a technical guarantee, or something in between.

“

Okta Concept Mapping

The closest IDAM analog is session data handling — specifically, whether a session token and its associated claims are truly ephemeral or logged somewhere in the authentication infrastructure. The analogy holds in that both involve interaction data that should not persist beyond its intended scope. The break point: in IDAM, session data is structured and bounded. It has a defined schema, a known size, and deletion is a database operation with verifiable semantics. In LLM infrastructure, "session data" includes embeddings, intermediate artifacts, and potentially fine-tuning contributions — none of which have clean deletion primitives. You can delete the log entry. You cannot delete the model's exposure to the data that generated it.