AI Copyright Exposure Is Two Questions, Not One

By Carey Whitten— May 5, 2026

AI Copyright Exposure Is Two Questions, Not One

The legal risk from AI and copyright isn't a single problem. It's two structurally different problems that happen to share a vocabulary, and conflating them in a procurement conversation — or worse, in a contract review — will cost you credibility with the legal and compliance teams who are increasingly in the room.

The precise exposure: AI systems trained on copyrighted material may have infringed those copyrights either (a) by using the material as training data, or (b) by acquiring that material through means that violated terms of service, access controls, or licensing agreements. These are distinct legal theories. They have different current outcomes in litigation. They are covered differently — or not at all — by vendor indemnification clauses. And they require different answers when a CAIO asks about them.

The Training Question

Fair use, in U.S. copyright law, is an affirmative defense that permits unauthorized use of copyrighted material under certain conditions. Courts evaluate it through a four-factor test, with transformative use carrying significant weight. The argument that training a model on copyrighted text transforms that text into statistical weights — rather than reproducing it — has generally fared well in early judicial treatment.

The New York Times Co. v. OpenAI, Inc. (S.D.N.Y., filed December 2023; active litigation) is the highest-profile test of this theory. The Times alleges that OpenAI reproduced copyrighted articles verbatim in training outputs, not merely in training. That distinction matters: the transformative-use argument is stronger when the model doesn't regurgitate source material, weaker when it demonstrably does. As of this writing, the case has survived early dismissal motions and is proceeding toward discovery. No outcome on the merits.

Andersen v. Stability AI Ltd. (N.D. Cal.; active, class action) raised similar claims on behalf of visual artists. Several claims were dismissed at the pleading stage; others survived and are on appeal in the Ninth Circuit. The appellate outcome will matter for how courts treat the transformative-use argument in image generation specifically.

Courts have not definitively ruled that training is fair use. They have declined to rule that it isn't, at the motion-to-dismiss stage. That's meaningfully different from settled law, and anyone representing it as settled is ahead of the evidence.

The Source-Acquisition Question

This is where significant money has already moved.

Several music publishers reached settlements with AI audio companies in 2024 and 2025 over training data obtained by scraping platforms in violation of terms of service. The legal theory wasn't primarily about what the model did with the music — it was about how the music was obtained. Breach of contract, circumvention of access controls, and in some cases Computer Fraud and Abuse Act claims ran alongside the copyright claims.

Getty Images (US), Inc. v. Stability AI, Ltd. (D. Del., filed February 2023; active litigation) combines both theories: unauthorized scraping of Getty's licensed image library and reproduction of Getty watermarks in generated outputs. The watermark claim is particularly notable because it doesn't require the court to resolve the training-as-fair-use question at all — it's a straightforward reproduction claim.

The source-acquisition theory is structurally more dangerous for enterprise buyers than the training theory, for a simple reason: it doesn't depend on how the model uses the data. If the vendor obtained training data improperly, the exposure exists regardless of what the model produces. And unlike a fair-use defense, there's no four-factor test that saves you. The question is whether the acquisition was authorized, full stop.

What Indemnification Actually Covers

Major AI vendors — OpenAI, Google (Vertex AI), Microsoft (Azure OpenAI Service) — have published IP indemnification provisions for enterprise customers. These are real commitments, not marketing language, but they have structural limits that matter in procurement conversations.

The typical coverage: the vendor will defend and indemnify the customer against third-party claims that the vendor's model, as delivered, infringes a third party's intellectual property. This covers the training question, to the extent the vendor is willing to stand behind it.

The typical exclusions: claims arising from the customer's own input data; claims arising from customer modifications to model outputs; claims arising from use cases outside the permitted scope defined in the agreement; and — critically — claims that the vendor's training data was improperly acquired. That last exclusion is rarely stated explicitly, but it follows from the structure of the indemnification, which covers the vendor's product, not the vendor's procurement practices.

Read the indemnification clause against the source-acquisition question and the coverage gap becomes visible. If a rights holder sues because a vendor scraped their content without authorization, the vendor's indemnification typically covers the customer's downstream liability only to the extent the claim is about the model's output. Claims about the underlying data acquisition are the vendor's problem until they become the customer's problem through contract structure — and whether they do depends on how the enterprise agreement is drafted.

When a CAIO asks whether the agency is exposed to copyright litigation from its AI vendor relationship, the right answer distinguishes these two tracks. On training: courts are treating it as likely fair use in current cases, but no appellate court has decided the question, and the litigation is active. On source acquisition: settlements have occurred, the legal theory doesn't require a fair-use analysis, and the vendor's indemnification may not cover it. Two tracks, two answers.

Okta Concept Mapping

The closest IDAM analog to vendor indemnification is license scope: a vendor grants you rights to use a resource and represents that the grant is valid. If the grant turns out to be defective, the vendor is on the hook. For the training question, that mapping holds — the vendor is representing that their model is properly licensed for your use. The source-acquisition question is where it stops working. In IDAM, a defective access grant gets revoked and re-provisioned; the exposure is bounded in time. In AI copyright, you can't un-train a model. The potentially infringing data is in the weights. Revoking the vendor relationship doesn't undo the training, and indemnification that covers your use of the model doesn't necessarily cover the exposure that was baked in before you signed the contract.

Litigation status and indemnification terms verified against publicly available sources as of May 2026. This lesson is flagged for triggered accuracy review upon significant court outcomes in referenced cases. Nothing here constitutes legal advice.