Question one is whether training a model on copyrighted material constitutes infringement. Question twois whether the method used to acquire that training data creates independent legal exposure. Courts have been addressing these questions on separate tracks, and the outcomes have diverged in ways that matter for anyone evaluating vendor indemnification claims.
Both questions remain live. Flag this lesson for review as cases develop.
The Training Question
The fair use analysis for AI training turns on four statutory factors: the purpose and character of the use, the nature of the copyrighted work, the amount used, and the effect on the market for the original. The transformative use argument — that training a model to recognize patterns is categorically different from reproducing content — has fared reasonably well in early rulings through early 2026.
In Andersen v. Stability AI (N.D. Cal.), the court dismissed several claims while allowing others to proceed on amended complaints, signaling that training-as-infringement theories face a high bar but aren't foreclosed. The New York Times v. Microsoft and OpenAI litigation (S.D.N.Y.) is the most significant live case; it has not been resolved as of this writing and involves claims that go beyond pure training to include memorization and verbatim reproduction of content in outputs. Courts have been more receptive to fair use arguments about the training process itself than about outputs that reproduce protected expression. That distinction carries weight in how the remaining litigation is likely to develop.
Courts have generally treated training as transformative. This is a trend in early rulings, not settled law. The Supreme Court has not weighed in. The Times case could shift the analysis materially depending on how the memorization claims are resolved.
The Source-Acquisition Question
This is where the money has already moved, and the legal theory is entirely different.
Getty Images' litigation against Stability AI, which resulted in a significant settlement in late 2025, was not primarily a fair use case. The core claims involved removal of copyright management information (the watermarks), breach of terms of service, and DMCA Section 1202 violations. These theories don't require a court to decide whether training is transformative. They require a court to decide whether the defendant had the right to access and use the content in the first place.
The legal exposure here is closer to unauthorized access and contract breach than to copyright infringement in the traditional sense. The fair use defense, the one vendors are implicitly relying on when they say training is transformative, doesn't reach acquisition claims. A vendor can win the training question and still face liability on the acquisition question.
If your vendor trained on licensed data, acquired through agreements with rights holders, the acquisition risk is theirs. If your fine-tuning pipeline ingests scraped web content, third-party datasets, or any data acquired outside a clear licensing chain, the acquisition risk is yours, regardless of what the vendor's indemnification clause says.
What Indemnification Actually Covers
OpenAI's Copyright Shield, Google's Vertex AI indemnification, Microsoft's Copilot Copyright Commitment, and Adobe's Firefly indemnification all follow the same structural logic: the vendor covers claims arising from the vendor's own training data decisions. None of them extend to the customer's fine-tuning data, customer-provided inputs, or data the customer acquired independently.
Adobe's indemnification is the narrowest in scope but the most defensible in construction: Firefly was trained on licensed Adobe Stock content and public domain material, so the vendor can make a specific claim about provenance. The indemnification covers outputs from that base model. It excludes outputs generated using customer-uploaded reference images, which may themselves carry unresolved rights questions.
Microsoft's Copilot Copyright Commitment covers infringement claims arising from Copilot's outputs when the product is used as intended. The exclusions include customer-provided content, scenarios where the customer has disabled safety filters, and use cases outside the product's documented scope.
OpenAI and Google follow similar patterns. The indemnification is genuine; these vendors are assuming real liability for their training data choices. The gap is that the indemnification scope was defined before courts fully defined the liability surface. A claim that emerges from a theory the vendor didn't anticipate when drafting the clause may fall outside coverage even if it seems related to training.
Read the exclusions before the coverage. The exclusions are where the actual risk allocation lives.
Okta Concept Mapping
Closest IDAM analogy: OAuth scope. When you issue a delegated access token, the scope parameter defines exactly what the token can authorize — nothing more. Vendor indemnification works the same way: it covers a defined set of claims, and liability outside that scope remains with the customer. The analogy holds for understanding that "covered" is always bounded. Where it breaks: OAuth scope is machine-readable and defined before the token is used. Indemnification scope is contract language written before courts defined the actual liability surface. You may be holding coverage for the wrong operations and not know it until the claim arrives.
When You'll Need This
A CAIO at a federal agency asks whether their AI vendor's enterprise agreement covers the agency's fine-tuning pipeline. The vendor says yes, they're indemnified. The CAIO needs to know whether "indemnified" covers the vendor's training data choices, the agency's fine-tuning data, or both.
Almost certainly it covers the vendor's training data only. If the agency is fine-tuning on government-produced documents with clear provenance, the acquisition risk is low regardless. If the pipeline ingests third-party data, commercial datasets, scraped content, anything acquired outside a documented licensing chain, the agency is carrying source-acquisition exposure that no vendor indemnification clause currently on the market will cover.
Before the next procurement conversation, ask where the fine-tuning data came from and whether anyone has documentation of the rights chain. That question separates covered risk from uncovered risk, and it has nothing to do with whether training is transformative.
Case outcomes referenced in this lesson are live and subject to change. Verify current status of cited litigation before using in client-facing materials. Last reviewed: May 2026.

