The AI copyright problem is actually two problems wearing the same name. Conflating them, which most enterprise conversations do, produces a kind of confident confusion: people who sound informed but are answering the wrong question.
The Training Question
When an AI model is trained on copyrighted material, the legal question is whether that use constitutes infringement or qualifies as fair use under 17 U.S.C. § 107. Courts analyzing this question have generally focused on the transformative use doctrine: does the use create something new with a different purpose and character than the original, or does it substitute for the original in the market?
In decided cases to date, courts have leaned toward finding model training transformative. The reasoning, roughly: a model trained on text doesn't reproduce that text; it extracts statistical patterns. The output isn't a copy of the input. In Authors Guild v. Google (2d Cir. 2015) — decided before the current wave of generative AI litigation but frequently cited as precedent — the court found that digitizing books to enable search was transformative even though the underlying copies were made in full. Subsequent decisions in the generative AI context have applied similar logic, though the factual record is more contested. [⚠️ Timeliness flag: All referenced case outcomes are live and subject to change. Accuracy review required against current docket status as of publication date.]
"Generally likely fair use" is not the same as settled law. The New York Times v. OpenAI litigation, filed in late 2023, specifically challenges whether verbatim reproduction of articles in training outputs breaks the transformative use argument — a factual question that could produce a different outcome than cases where reproduction wasn't at issue. The training question is open. It's just less open than most enterprise buyers assume.
The Source-Acquisition Question
This is where significant money has already moved, and it's legally distinct from the training question.
The source-acquisition question asks about vendor conduct: how they obtained the data, not what the model subsequently did with it. Scraping copyrighted content without a license, ignoring robots.txt directives, or using datasets assembled through terms-of-service violations — these are conduct questions that don't depend on transformative use doctrine at all. A vendor can win the training question and still have liability exposure from how it built its dataset.
Getty Images' litigation against Stability AI (ongoing as of this writing) centers substantially on this point: whether Stability AI had any right to use Getty's licensed image library as training input, independent of whether image generation is transformative. [⚠️ Timeliness flag: Verify current status before publication.] Several smaller settlements in the 2024–2025 period involved similar fact patterns — vendors paying to resolve claims about dataset assembly rather than model outputs.
For enterprise buyers, this distinction matters because your exposure profile differs depending on which question is in play. If a court eventually rules that training is not fair use, every model trained on unlicensed data has a problem. If the source-acquisition question generates liability, the vendors who built clean datasets are differentiated from those who didn't — and your vendor's indemnification clause becomes the operative document.
What Indemnification Actually Covers
The major providers have published indemnification programs with names that sound comprehensive. Read the scope terms carefully.
OpenAI's Copyright Shield, Google's generative AI indemnification, Microsoft's Copilot Copyright Commitment, and Anthropic's comparable provisions share a common structure: they cover third-party copyright claims arising from outputs generated by the model when used as directed. The vendor defends you if someone claims your AI-generated content infringes their copyright.
What they generally exclude: claims arising from inputs you provided (if you fed copyrighted material into a prompt), claims in regulated industries where the terms carve out specific use cases, claims arising from willful infringement (if you instructed the model to reproduce a specific copyrighted work), and — critically — claims about the training data itself. The indemnification runs downstream from the model, not upstream into how the model was built.
The gap worth understanding: if a court eventually finds that training on unlicensed data constitutes infringement, and that infringement is embedded in every output the model produces, the output indemnification may not cover the resulting claims. The vendors have not, as of this writing, offered indemnification for training-data liability. That omission is the most contested legal question in the space, and the clause language reflects it.
Okta Concept Mapping. The closest IDAM analogy is federated identity liability. When you accept a SAML assertion from an external IdP, you're downstream of their authentication decision — if their credential was compromised, you inherit the exposure. Vendor indemnification in AI copyright works similarly: you're downstream of the training decision, trusting the vendor's claim that their data acquisition was clean. The analogy holds for understanding the liability structure. It breaks because federation trust is governed by an explicit, bilateral federation agreement with defined liability allocation. AI indemnification is a unilateral vendor clause in terms of service, changeable on notice, with no negotiated allocation of training-data risk. That asymmetry is the gap the analogy can't cover.
Before the Next Procurement Call
When legal or procurement raises AI copyright concerns, ask which question they're actually asking. Training question or source-acquisition question? The answer determines whether the vendor's indemnification clause is responsive.
For public sector buyers, the source-acquisition question carries additional weight: procurement regulations in some agencies require vendors to represent that their products don't incorporate unlicensed third-party IP. Whether training data qualifies as "incorporated" IP is unsettled, but the representation request is already appearing in RFP language.
Three questions worth putting to a vendor: Does your indemnification cover claims arising from training data, or only from outputs? Have you published your dataset provenance or licensing agreements? Has your legal team issued a written opinion on training-data acquisition?
Most vendors will not answer the third question in writing. That's information too.

