Copyright and the Training-Data Split

Courts are splitting AI copyright into two fights: training versus how the data was acquired. Here's what that means for your deal.

By Leigh Garrity— May 8, 2026

Courts are splitting AI copyright into two fights: training versus how the data was acquired. Here's what that means for your deal.

The training-data copyright dispute is about whether AI companies can legally use copyrighted material to build models. The answer depends on which part of "use" you're asking about, because courts have started splitting that question in two.

Question one: Is training an AI model on copyrighted material fair use?

Question two: How did the AI company get that material in the first place?

The answers are diverging. Training is trending toward legal. Piracy remains straightforwardly infringing. If you only remember one thing from this piece, remember that the split itself defines the risk landscape your buyers are navigating.

Flagged for Accuracy Review

The Bartz v. Anthropic final fairness hearing is scheduled for May 14, 2026. Major fair use rulings in the Sony v. Suno and Sony v. Udio music cases are expected summer 2026. Outcomes may materially change this analysis.

The mechanism behind the split

Fair use is a defense in copyright law. It permits use of copyrighted material without permission when the use is sufficiently different in purpose and character from the original. The Supreme Court's 2023 ruling in Andy Warhol Foundation v. Goldsmith actually narrowed the transformative use doctrine, pushing back on the idea that any aesthetic transformation qualifies and emphasizing the purpose and character of the specific use. That tighter standard is the backdrop for everything that follows. When a federal judge calls AI training "exceedingly transformative," he's clearing a bar the Supreme Court just raised.

Training a large language model extracts statistical patterns from text. It doesn't store copies. The model can't retrieve page 47 of a specific book on demand. Page 47 doesn't exist inside the model as a stored document. But outputs can sometimes reproduce passages from training data, which is exactly what the NYT v. OpenAI case alleges. What exists is a set of learned relationships between tokens, derived from millions of documents. The output serves a different purpose than any individual input.

In June 2025, Judge Alsup in the Northern District of California ruled on exactly this question in Bartz v. Anthropic. His conclusion: training is fair use, and emphatically so. He characterized the transformation in extraordinary terms, calling the use of copyrighted books to train Claude "exceedingly transformative" and "transformative — spectacularly so", likening what Anthropic was doing to the human act of reading existing texts and writing new ones. He also rejected the argument that an emerging market for AI training licenses was one the Copyright Act entitles authors to exploit. Two days later, Judge Chhabria reached the same result in Kadrey v. Meta, finding Meta's copying of books for LLM training was fair use (White & Case corroborates). Two federal judges, same district, same week, same conclusion: training transforms copyrighted input into something categorically different.

Then Alsup did something that changed the entire litigation landscape. He split the case.

Anthropic had downloaded over seven million books from LibGen and Z-Library, both pirate sites. Alsup ruled that this acquisition was a separate act from training and was not fair use (Norton Rose Fulbright reaches the same reading). His reasoning was surgical: downloading pirated copies to build a research library is its own use, and it's not transformative. You can't launder pirated material through a transformative process. The fact that the books would eventually be used for something transformative didn't retroactively clean the download. In the court's words:

“

"No damages from pirating copies could be undone by later paying for copies of the same works."

Two acts. Act one: downloading seven million books from pirate sites. Act two: training a model on those books. Act two is fair use. Act one is infringement. No amount of act-two fairness washes act-one piracy.

Note the doctrinal divergence between the two judges, though. Alsup analyzed each step separately: acquisition and training as distinct legal acts. Chhabria viewed downloading and training as one integrated process, finding the initial downloading must be considered "in light of its ultimate, highly transformative purpose." If Chhabria's approach prevails in future cases, the acquisition question gets folded into the training question. If Alsup's approach prevails, acquisition stands alone, and provenance becomes the litigation target. This disagreement hasn't been resolved.

Anthropic settled the piracy claims for $1.5 billion. Of 482,460 eligible works, 440,490 have been claimed — a 91.3% claim rate against a typical class action rate of around 10%. The estimated payout is roughly $2,931 per work after deductions. The final fairness hearing is May 14, 2026. Six days from now. The settlement has not received final court approval, Anthropic has denied wrongdoing, and the settlement creates no binding precedent on fair use for future courts. The class was certified only for piracy, not for the act of training, so the fair use holding binds only the three named plaintiffs.

The precedent it creates is entirely practical. What copyright scholar Ed Lee has called the "Shadow Library Strategy" — a term he coined to describe the emerging litigation pattern — skips the fair use argument on training entirely and goes straight for the piracy angle. A $3 billion lawsuit filed by UMG, Concord, and ABKCO against Anthropic in January 2026 (Copyright Alliance tracker corroborates) runs exactly this play, alleging Anthropic torrented song lyrics from shadow libraries. The legal theory leans directly on Alsup's ruling.

Two important caveats on the doctrinal picture. Thomson Reuters v. Ross Intelligence in Delaware found against fair use where a competitor trained an AI search tool on proprietary Westlaw headnotes — the court called the questions "hard" and granted interlocutory appeal to the Third Circuit, where it remains pending. And NYT v. OpenAI raises a distinct question about outputs that reproduce substantial portions of copyrighted text, with discovery ongoing and 20 million ChatGPT logs ordered produced. Output reproduction is a different legal problem from training, and this piece won't try to resolve it.

The Sony v. Suno and Sony v. Udio music cases will deliver the next significant fair use rulings, expected this summer. Warner Music and Universal Music Group have already settled with Suno and Udio respectively, but Sony has settled with neither. If courts extend the training-as-transformative-use reasoning to music, the doctrinal trend solidifies broadly. If they don't, the split widens and acquisition becomes even more central to litigation strategy.

• The split: Courts are separating training (trending toward fair use) from source acquisition (pirated sources not protected). Financial exposure concentrates on the acquisition side. Major rulings pending summer 2026.

“

Okta Concept Mapping: Data Provenance and Chain of Custody

If you've explained to a buyer why it matters where a credential was issued and how it was handled across its lifecycle, you already have the right instinct for why courts care about training-data sourcing. Alsup's ruling essentially says data acquired through unauthorized channels taints downstream use regardless of what happens next. Chain of custody is chain of custody.

Here's where your IDAM intuition breaks. In identity, establishing proper provenance and access controls makes the use authorized. Period. In copyright, "transformative use" is a legal judgment about the nature of what the use produces, independent of whether access was authorized. Anthropic had terrible provenance (pirate sites) but the training itself was still fair use. You can have perfect chain of custody and still infringe if your output reproduces protected expression. You can have zero provenance documentation and still qualify for fair use if the transformation is sufficient. The legal question runs orthogonal to the access-control question. Provenance solves the acquisition problem. The use problem is a separate legal universe.

What vendor indemnification actually covers

Every major AI vendor offers copyright indemnification for enterprise customers. The coverage is narrower than most buyers assume, and the gaps are specific enough to be worth naming.

The structural pattern is consistent. Microsoft's Copilot Copyright Commitment, OpenAI's Copyright Shield, and Google's generative AI indemnification all cover the same core scenario: a third party sues your organization claiming that an AI-generated output infringes their copyright. The vendor defends you and pays resulting judgments or settlements.

Google's terms are structurally distinct. They include a separate provision covering training-data claims, meaning Google will defend customers against allegations that Google's own training process infringed third-party IP. Microsoft and OpenAI's published terms cover output-side claims only.

The exclusions are where buyers get surprised. All vendors exclude:

Customer fine-tuning with external data. Upload content you don't own to customize a model, and the output falls outside indemnification. Microsoft's terms are explicit: the customer "must have sufficient rights to use the input" including any data used for customization. OpenAI's Copyright Shield similarly excludes customer content and modifications from coverage.

Intentional or knowing infringement. If you knew or should have known the output was infringing, coverage evaporates.

Free-tier or beta usage. OpenAI excludes beta services entirely. All vendors exclude free tiers.

Trademark claims. Universally excluded.

Third-party models on vendor platforms. Google explicitly disclaims liability for non-Google models deployed through Vertex AI.

One gap worth flagging for the Bartz scenario specifically: vendor indemnification protects you against claims that your use of the output infringes. It provides no insulation from the vendor's own upstream acquisition liability. If a vendor's training data was pirated, the resulting claims run against the vendor, not the customer. But the reputational and procurement risk of deploying a model built on contested data is a calculation no indemnification clause touches.

Anthropic's enterprise indemnification terms are not publicly documented in sufficient detail to compare directly. Analyst coverage (Redress Compliance, February 2026) indicates Anthropic offers enterprise IP indemnity, but verify against Anthropic's published terms before relying on any characterization in a procurement conversation.

• Indemnification: Vendor copyright indemnification covers output-side infringement claims against the customer. It leaves out upstream acquisition risk, customer fine-tuning with unlicensed data, free-tier usage, and the reputational exposure of deploying models with contested training provenance.

The procurement conversation

So where does this land in your account?

Public sector buyers evaluating AI tools operate under OMB M-25-22 (Wiley Law analysis), which requires contract terms prohibiting vendors from using non-public government data to train commercially available models. That addresses the reverse data flow: government data leaking into vendor training sets. The requirement says nothing about vendors certifying the provenance of their own training data or providing copyright indemnification as a procurement condition.

OMB M-26-04 (Crowell & Moring analysis) requires documentation of how models are "built, trained, or modified," which creates an oblique provenance disclosure requirement. But it was written with bias in mind, and copyright wasn't the concern.

No standardized federal contract clause requires training-data provenance documentation today. That gap is exactly where your buyer's questions will cluster.

The questions you'll hear:

"Does your vendor indemnify us for AI-generated outputs?" Yes, for paid commercial tiers, with the exclusions above. Real and meaningful protection against the most likely claim a government buyer would face.

"What if the model was trained on pirated data?" The major vendors have not been found liable for training-data piracy. The Bartz settlement resolved Anthropic's specific exposure without an admission of wrongdoing. But the legal theory is live, new cases are being filed on the same basis, and no vendor's standard indemnification extends to covering a buyer's reputational exposure from this risk.

"Should we require training-data provenance documentation?" No federal requirement yet. But the trajectory of the litigation suggests this will become a standard procurement ask. The shadow library strategy works precisely because provenance failures are easier to prove than fair use is to defeat. A buyer who starts asking now is ahead of the curve.

The summer 2026 rulings will materially affect this landscape. The Bartz final approval hearing on May 14 will determine whether the $1.5 billion settlement stands and, with it, the financial benchmark for piracy-based AI copyright claims. Watch both.

• Procurement: Federal guidance addresses government data flowing into vendor training, but says nothing about the copyright status of vendor training data. Vendor indemnification covers output claims. The acquisition-side risk is real, unresolved, and not yet addressed by standardized procurement language. Flag for update as summer 2026 decisions arrive.

Things to follow up on...

Sony's summer fair use rulings: The Sony v. Suno and Sony v. Udio cases are expected to produce the next major fair use decisions this summer, and a ruling against transformative use in music would fracture the doctrinal trend that's been building since Bartz.
Thomson Reuters on appeal: The Third Circuit's pending decision in Thomson Reuters v. Ross Intelligence could redefine fair use boundaries when AI is trained on proprietary, curated content to build a direct competitor.
Google's training-data indemnity: Google is the only major vendor whose published terms include a separate indemnification provision for training-data claims, covering allegations that Google's own use of training data infringed third-party IP — a structural distinction worth tracking as procurement language evolves.
OMB's provenance gap: No federal contract clause currently requires AI vendors to certify training-data provenance, but OMB M-26-04's model documentation requirements create an indirect disclosure hook that procurement officers may start leveraging before formal guidance catches up.