Three disclosure artifacts have become the vocabulary of AI accountability: model cards, system cards, and data sheets. Knowing what each one actually documents — and where all three fall short — is what separates a vendor that looks compliant from one that can demonstrate it.
Model cards document a machine learning model as an artifact. They specify the model's intended use cases, its evaluation methodology, its performance metrics across relevant subgroups, and its known limitations. A model card is produced by the model developer and travels with the model. The concept was formalized in a 2019 paper by Mitchell et al. at Google, and the format has since been adopted widely enough that it is now referenced explicitly in EU AI Act technical documentation requirements under Annex IV.
System cards document the deployed system — not the model in isolation but the model as integrated into a specific product or workflow. They describe what guardrails the operator has applied, which use cases are enabled or restricted, what human oversight mechanisms exist, and how the system's behavior has been tested in the deployment context. Meta published system cards for its Llama model family; the format is now standard practice for foundation model operators. System cards are the operator's documentation layer, distinct from what the model developer provides.
Data sheets for datasets document provenance. Where did the training data come from? How was it collected, filtered, and labeled? What known biases exist in the distribution? What uses is it appropriate for? The format originates in a 2018 paper by Gebru et al. and addresses a gap that model cards don't: a model card can document evaluation results, but without dataset documentation, you cannot evaluate whether the evaluation was meaningful.
How the Frameworks Converge
The EU AI Act, NIST AI RMF, and ISO/IEC 42001 use different vocabulary and carry different enforcement weight. They are converging on the same evidentiary requirements.
The EU AI Act's Annex IV specifies technical documentation for high-risk AI systems: general system description, design specifications, training and validation data descriptions, monitoring and control procedures, and post-market monitoring plans. NIST's AI RMF Playbook, through its MEASURE and GOVERN functions, calls for documented evaluation methodology, performance monitoring against defined metrics, and organizational policies governing AI deployment decisions. ISO/IEC 42001, published in December 2023, establishes an AI management system standard with explicit controls for impact assessment, lifecycle documentation, and performance evaluation — Annex A controls A.6 and A.7 map directly onto what the EU Act requires in Annex IV.
The convergence is not coincidental. All three frameworks are responding to the same regulator question: if something goes wrong with this AI system, can you show us what it was supposed to do, how you verified it was doing that, and what you did when it wasn't?
An enterprise-grade AI audit trail answers that question across four layers. First, dataset documentation: what data trained and validated the model, with provenance and known limitations. Second, model documentation: architecture, training methodology, evaluation results on defined benchmarks, performance across demographic subgroups. Third, deployment documentation: how the model is integrated, what the operator has modified or restricted, what use cases are in scope. Fourth, operational monitoring: drift detection logs, performance degradation records, incident reports, and records of human review decisions. Each layer depends on the one before it. Missing any one of them leaves a gap that a regulator — or a plaintiff's attorney — will find.
The Deterministic System Problem
Most current AI audit practices break down at a specific point, and it explains why a vendor's existing compliance posture is probably not sufficient even if they have model cards and a SOC 2 report.
SOC 2 was designed for deterministic systems. A system is deterministic when a given input reliably produces a given output and the decision chain is traceable. Access was requested, access was granted or denied, the event was logged. An auditor can verify that the control operated as designed because the evidence is complete and the logic is fixed.
AI systems are not deterministic in this sense. The same prompt sent to the same model can produce different outputs depending on temperature settings, conversation history, and distributional shifts in the underlying model weights. More critically, the internal reasoning that produced the output is not logged. You can capture inputs and outputs. You cannot capture the decision chain between them.
Regulators are beginning to require exactly what current audit practices cannot provide: documented evidence of how the system made specific decisions, not just what decisions it made. The EU AI Act's post-market monitoring requirements, NIST's MEASURE 2.5 practices, and ISO 42001's Clause 9 performance evaluation controls all point toward ongoing behavioral attestation — continuous evidence that the system's behavior matches its documented intent across the full distribution of inputs it actually encounters, rather than a point-in-time audit of controls.
Most vendors cannot produce this evidence today. The logging infrastructure doesn't exist, the evaluation methodology hasn't been defined, and the baseline against which drift would be measured was never established. A model card and a SOC 2 report together are not a substitute. When a federal procurement officer or a CISO asks whether a vendor is "enterprise-ready" on AI compliance, this is the specific question worth asking: does the documentation capture behavioral evidence from the system as deployed, over time, against a defined standard of appropriate behavior? Most current answers are incomplete, and the vendors asking you to accept them probably know it.
Okta Concept Mapping
The closest IDAM analogy is SOC 2. Both frameworks ask a third party to verify that a system behaves as claimed, using documented controls and periodic evidence review. The analogy holds for the governance structure: policies exist, controls are defined, audits happen on a schedule. The evidentiary model is where it stops holding. SOC 2 asks whether a control ran and gets a binary answer — the log either shows the access event or it doesn't. An AI audit asks whether the model behaved appropriately and cannot get a binary answer, because "appropriate" is a function of the input distribution, the deployment context, and the evaluation methodology, none of which SOC 2's framework was designed to specify or verify. The gap is architectural, not procedural.

