The Disclosure Stack: Model Cards, System Cards, and the Audit Trail Gap SOC 2 Can't Fill

By Carey Whitten— May 5, 2026

The Disclosure Stack: Model Cards, System Cards, and the Audit Trail Gap SOC 2 Can't Fill

What These Documents Actually Are

A model card documents a specific trained model: its architecture, the data it was trained on, the evaluation methodology used to assess it, its intended use cases, its known limitations, and its performance across demographic subgroups where relevant. Think of it as the controlled document for the model artifact itself. Google's research team formalized the concept in 2019; the format has since been adopted widely across open-source repositories and increasingly referenced in procurement requirements.

A system card documents a deployed system — not the model in isolation, but the full pipeline: how the model is integrated, what guardrails and filters are applied, what human oversight mechanisms exist, and how the system behaves in its operational context. OpenAI's GPT-4 system card is the most cited example. It answers questions the model card can't: what happens when this model is wrapped in an application and exposed to users?

A datasheet for datasets (or data sheet) documents the training or evaluation dataset itself: where the data came from, how it was collected, what was filtered and why, what known biases were characterized, and what uses the dataset is and isn't appropriate for. Timnit Gebru and colleagues introduced this framing in 2018, explicitly borrowing the concept from hardware manufacturing, where component datasheets are standard practice.

Three layers: dataset, model, system. Each answers a different question. Regulators are beginning to require all three, or their functional equivalents.

Where the Requirements Are Coming From

The EU AI Act's technical documentation requirements for high-risk AI systems (Articles 11 through 17) specify that providers must document training data characteristics, evaluation methodology, performance metrics, and post-market monitoring procedures. The Act doesn't mandate model cards by name, but a compliant technical documentation package looks structurally identical to a model card plus a system card plus a data sheet.

NIST AI RMF organizes requirements around four functions: Govern, Map, Measure, Manage. The MEASURE function explicitly requires documented evaluation methodology — how the system was tested, against what benchmarks, with what results. The MANAGE function requires post-deployment monitoring with documented procedures. The RMF doesn't prescribe artifact formats, but the disclosure artifact genre is the natural implementation response.

ISO/IEC 42001, the AI management system standard, creates documentation obligations through its performance evaluation clauses. Clause 9 requires organizations to evaluate AI system performance against defined criteria and retain documented evidence of results. Clause 10 requires documented evidence of corrective actions when performance deviates. This is quality management system logic applied to AI, and it demands the same provenance, evaluation, and monitoring documentation the other frameworks require.

Three frameworks, independently developed, are pointing at the same evidentiary gap: organizations deploying AI systems need documented evidence of what the system was trained on, how it was evaluated, and how it's performing after deployment.

What Enterprise-Grade Looks Like

An enterprise-grade AI audit trail has five layers. Dataset provenance: where training data originated, what was included or excluded, what bias characterization was performed. Evaluation methodology: what benchmarks were used, what held-out test sets, what red-teaming or adversarial testing, what fairness metrics across relevant subgroups. Model documentation: version, architecture, intended use, known failure modes. Deployment configuration: system prompts, guardrails, human-in-the-loop mechanisms, access controls. Post-deployment monitoring: drift detection procedures, output sampling methodology, incident logging, performance comparison against evaluated baseline.

In a public sector procurement context, a CISO doing due diligence on an AI system for benefits determination or case management will eventually ask for evidence across all five layers. "Enterprise-ready" as a vendor posture needs to be substantiated by documentation that maps to this structure. A vendor who can produce a model card and a system card but has no documented post-deployment monitoring procedure has a gap that will surface in the security review.

“

Okta Concept Mapping

The closest IDAM analogue to the AI audit trail is a SOC 2 Type II report. Both are structured evidence packages designed to give a relying party confidence that a system behaves as documented. SOC 2 works well here: it establishes that controls exist, that they operated over a defined period, and that an independent auditor reviewed the evidence.

The analogy breaks at the decision chain. SOC 2 was designed for deterministic systems. If the controls are in place, the system behaves as specified — that assumption is baked into the audit logic. AI systems break it. The same model, with identical controls, can produce different outputs for similar inputs. A SOC 2 report can confirm that the model was deployed behind an authenticated API with appropriate access logging. It cannot tell you why the model produced a particular output, whether that output was consistent with the model's evaluated behavior, or whether the model's behavior has drifted from its documented baseline since deployment. Regulators are beginning to require exactly that decision-chain evidence. SOC 2 doesn't carry it, and no amount of SOC 2 scope expansion will make it do so. The disclosure artifact genre exists precisely because the existing audit vocabulary doesn't have a word for this.

When You'll Need This

The question is already appearing in federal RFIs. Agencies issuing solicitations for AI-assisted case processing, document review, or decision support are beginning to include documentation requirements that map directly to model card and system card content — sometimes by name, more often by description. When a buyer asks whether a vendor's AI system has been evaluated for fairness across demographic groups, or whether there's a documented procedure for detecting post-deployment performance drift, they're asking for disclosure artifact content. A seller who recognizes the question and knows what evidence should exist — and what the gap looks like when it doesn't — is in a different conversation than one who reaches for the SOC 2 report.

The audit trail gap surfaces when a technically sophisticated buyer reads the vendor's documentation package and finds that it describes controls but not decisions. That's the conversation to be ready for.

What These Documents Actually Are

Three layers: dataset, model, system. Each answers a different question. Regulators are beginning to require all three, or their functional equivalents.