AI Training Data 2026: When Provenance Becomes a Business Risk
Within one week in June 2026, Microsoft came under pressure over contradictory claims about its training data, a report surfaced on xAI's covert use of Anthropic's Claude, and the start of EU enforcement for general-purpose AI moved closer. For you as a decision-maker, a once-academic question moves into procurement: what was this model actually trained on? This article frames the cases, explains the EU AI Act and the European copyright angle, and shows how to check data provenance before you deploy a model.
AI training-data provenance has become a legal and procurement question in 2026. At Build 2026, Microsoft marketed its MAI models as built on clean, commercially licensed data without distillation, yet its own technical paper lists Common Crawl with 24.2 billion web-scraped pages. Almost simultaneously, reports described how Elon Musk's xAI distilled Claude outputs into its own coding models for months and kept going through private accounts after Anthropic cut off access. The legal picture is tightening: the EU AI Act has required providers of general-purpose AI to publish a summary of their training content and maintain a copyright policy since August 2025, with fines possible from 2 August 2026. In the US, Anthropic pays 1.5 billion US dollars in the Bartz case for around 500,000 book titles, the country's largest copyright settlement. For companies this means provenance belongs in vendor due diligence, in contracts with indemnification clauses, and in their own data governance.
Why AI data provenance becomes a business risk
Data provenance now decides in 2026 whether an AI model can be deployed with legal confidence in a company. The question what was this model trained on? has moved from academic to procurement, because legal teams check the data lineage of popular models before deploying them in finance, healthcare or government.
Provenance touches three layers at once: copyright, contract law and reputational risk. A company that buys a model indirectly inherits the provider's training-data risks. How to connect legal duties with day-to-day practice is covered by innobu in the guide to ethical and legal AI compliance .
Microsoft and xAI: two lessons in data provenance
Both cases share one core point: marketing claims about training data do not necessarily match technical reality, and companies have to check this themselves. Microsoft AI chief Mustafa Suleyman described the MAI-Thinking-1 model as trained from the ground up on clean, commercially licensed data without distillation from third-party models. The published technical paper contradicts that account.
Trained from the ground up on clean, commercially licensed data. The company's own technical paper still lists Common Crawl.
Contrast of the Microsoft claim and the MAI paper, June 2026The lesson for decision-makers is sober: a label like clean or licensed is not a check, it is a claim. It only becomes reliable through the documented data source, and that is exactly what the EU AI Act now requires.
The EU AI Act makes provenance a duty
The EU AI Act has, since 2 August 2025, required providers of general-purpose AI to give reliable information about their training data for the first time. At its core is a sufficiently detailed summary of training content using a binding AI Office template (Article 53(1)(d)) plus a copyright policy (Article 53(1)(c)). From 2 August 2026 the Commission can enforce breaches with fines.
The AI Office template requires three blocks: general model information, a list of data sources including the top 10 percent of domain names for web-scraped content, and processing details including how opt-outs under the Copyright Directive were respected. Around 24 providers have signed the accompanying GPAI Code of Practice, among them Anthropic, Google, Microsoft, Mistral and OpenAI. Meta has not signed.
The summary is not a marketing text but a verifiable document. For models placed on the market before 2 August 2025 the duty applies only from 2 August 2027, while it already applies to new models. Anyone procuring models can request these documents and hold them against the marketing.
How the GPAI duties fit into the larger EU AI Act timeline is shown by innobu in the article on the EU AI Act high-risk deadlines through 2027 and 2028 .
German perspective: Section 44b UrhG and the opt-out
In Germany, Section 44b UrhG governs text and data mining and implements the EU exception: reproducing lawfully accessible works is permitted unless the rightsholder declares a reservation of use. For online works this reservation is only valid if it is machine-readable. This is exactly where the dispute sits in 2026.
The Hamburg Regional Court ruled, in the case of photographer Robert Kneschke against dataset creator LAION, that AI training is not categorically outside the scope of Section 44b. The court also considered it conceivable that a reservation expressed in natural language can suffice as machine-readable, because modern AI can interpret language.
What this means for your own content: Anyone who does not want to end up in training data must set the reservation actively and machine-readably. In practice this works today via robots.txt and the W3C TDM-Rep protocol. A mere notice in an imprint is, by common reading, not enough.
For companies that process personal data, data protection adds another layer, since training and input data can fall under the GDPR. How AI security and data protection interlock in a European context is explored by innobu in the article on AI security and data protection for the enterprise .
What the 1.5-billion settlement signals
The settlement in Bartz v. Anthropic shows that the biggest risk is not use as such, but the way the data was acquired. Anthropic pays 1.5 billion US dollars into a settlement fund, around 500,000 book titles are covered, and rightsholders can expect at least about 3,000 US dollars per title. It is the largest copyright settlement in the US.
Judge William Alsup ruled in 2025 that training on legally acquired books can be fair use, but downloading from shadow libraries such as LibGen is not. In early 2026 around 51 AI copyright suits were pending in US courts. The message for providers and customers: the origin and acquisition path of training data are litigable, not just model use.
Fair use is a US concept and does not apply in the EU. For European companies what counts is the acquisition path plus the opt-out situation under Section 44b UrhG, not a provider's blanket fair-use assumption.
Challenges and risks
The topic is not one-sided. Fully licensing all training data is expensive and barely workable in practice, and part of the industry sees web scraping as a necessary basis for innovation. At the same time, companies that deploy such models carry a real residual risk.
The real company risk: Model providers often disclose their data lineage only partly, indemnities apply only under conditions, and in regulated sectors an unclear data lineage can block procurement. Skipping the check does not remove the risk, it only hides it.
What companies should do now
Companies should build data provenance firmly into AI procurement instead of relying on marketing claims. Three levers are immediately actionable: vendor due diligence, contractual protection and your own data governance.
-
Check the provider
Request the GPAI training-data summary and the copyright policy, and compare them against the marketing claims. Look at the named data sources and the handling of opt-outs, not at slogans such as clean or licensed.
-
Protect via contract
Use indemnification clauses. Microsoft's Copilot Copyright Commitment, Google Cloud, OpenAI's Copyright Shield and Anthropic offer them, each with conditions such as active safety filters and rights to the input data. Clarify what the indemnity really covers.
-
Order your own data
When fine-tuning your own models, document the provenance of your training data and respect third-party opt-outs via robots.txt and TDM-Rep. This stops you from carrying the problem into your own models.
-
Match sector and workload
Decide where an unclear data lineage is acceptable and where it is not. In finance, healthcare and government, data provenance should be part of the sign-off before a model goes into production.
In 2026, data provenance is no longer a niche topic for lawyers but part of any serious AI procurement. Checking providers, protecting contracts and ordering your own data lowers your risk markedly without slowing AI adoption.
Further reading
Frequently asked questions
Because legal teams check how a model was trained before deploying it. Microsoft marketed its MAI models as built on clean, licensed data, yet its own paper lists Common Crawl with 24.2 billion web-scraped pages. A company that buys a model indirectly inherits the provider's training-data risks across copyright, contract and reputation.
Providers of general-purpose AI must, since 2 August 2025, publish a sufficiently detailed summary of training content using an AI Office template (Article 53(1)(d)) and maintain a copyright policy (Article 53(1)(c)). From 2 August 2026 the Commission can enforce breaches with fines.
Section 44b UrhG, which implements the EU text-and-data-mining exception, permits mining of lawfully accessible works unless the rightsholder declares a reservation of use. For online works this reservation is only valid if machine-readable, for example via robots.txt or the TDM-Rep protocol. The Hamburg Regional Court held that AI training is not categorically outside its scope.
In Bartz v. Anthropic, Anthropic pays 1.5 billion US dollars into a settlement fund, around 500,000 book titles are covered, and rightsholders can expect at least about 3,000 US dollars per title. Judge Alsup ruled in 2025 that training on legally acquired books can be fair use, but downloading from shadow libraries is not.
Three steps are immediately actionable: first, request the provider's GPAI training-data summary and copyright policy and compare them against the marketing; second, use indemnification clauses in contracts, such as Microsoft's Copilot Copyright Commitment, OpenAI's Copyright Shield or the offers from Google and Anthropic, each with conditions; third, when fine-tuning your own models, document data provenance and respect third-party opt-outs.