Passer-by browsing second-hand books at a kerbside book stall outside a small bookshop on a city street

AI Training Data 2026: When Provenance Becomes a Business Risk

A Microsoft contradiction, covert Claude use and a record settlement: why a model's data lineage decides deployment in 2026.

Within one week in June 2026, Microsoft came under pressure over contradictory claims about its training data, a report surfaced on xAI's covert use of Anthropic's Claude, and the start of EU enforcement for general-purpose AI moved closer. For you as a decision-maker, a once-academic question moves into procurement: what was this model actually trained on? This article frames the cases, explains the EU AI Act and the European copyright angle, and shows how to check data provenance before you deploy a model.

Summary

AI training-data provenance has become a legal and procurement question in 2026. At Build 2026, Microsoft marketed its MAI models as built on clean, commercially licensed data without distillation, yet its own technical paper lists Common Crawl with 24.2 billion web-scraped pages. Almost simultaneously, reports described how Elon Musk's xAI distilled Claude outputs into its own coding models for months and kept going through private accounts after Anthropic cut off access. The legal picture is tightening: the EU AI Act has required providers of general-purpose AI to publish a summary of their training content and maintain a copyright policy since August 2025, with fines possible from 2 August 2026. In the US, Anthropic pays 1.5 billion US dollars in the Bartz case for around 500,000 book titles, the country's largest copyright settlement. For companies this means provenance belongs in vendor due diligence, in contracts with indemnification clauses, and in their own data governance.

Why AI data provenance becomes a business risk

Data provenance now decides in 2026 whether an AI model can be deployed with legal confidence in a company. The question what was this model trained on? has moved from academic to procurement, because legal teams check the data lineage of popular models before deploying them in finance, healthcare or government.

1.5bn
US-dollar settlement
Bartz v. Anthropic
24.2bn
web-scraped pages
in the MAI corpus
51
AI copyright suits
US courts, early 2026
2 Aug 2026
GPAI enforcement
fines possible
~500,000
book titles covered
in the Anthropic settlement
~24
Code of Practice signatories
GPAI providers
Data provenance is the traceable origin of a model's training data, that is, which sources were used, how they were acquired, and whether rights and opt-outs were respected.

Provenance touches three layers at once: copyright, contract law and reputational risk. A company that buys a model indirectly inherits the provider's training-data risks. How to connect legal duties with day-to-day practice is covered by innobu in the guide to ethical and legal AI compliance .

Microsoft and xAI: two lessons in data provenance

Both cases share one core point: marketing claims about training data do not necessarily match technical reality, and companies have to check this themselves. Microsoft AI chief Mustafa Suleyman described the MAI-Thinking-1 model as trained from the ground up on clean, commercially licensed data without distillation from third-party models. The published technical paper contradicts that account.

Microsoft MAI: claim versus paper
Marketing claim: clean, commercially licensed data without distillation
Paper: pipeline starts with around 1.2 trillion crawled pages, filtered to 794 billion
Common Crawl as a component, 24.2 billion web-scraped pages
Common Crawl makes no licensing representations and pays no rightsholders
xAI: covert Claude use
Reportedly distilled Claude outputs into its own coding models for months
After Anthropic cut off access in January 2026, continued via private accounts and Blackbox AI
Anthropic's terms (Section D.4) prohibit training competing models
Anthropic now acts specifically against unauthorized Claude use

Trained from the ground up on clean, commercially licensed data. The company's own technical paper still lists Common Crawl.

Contrast of the Microsoft claim and the MAI paper, June 2026

The lesson for decision-makers is sober: a label like clean or licensed is not a check, it is a claim. It only becomes reliable through the documented data source, and that is exactly what the EU AI Act now requires.

The EU AI Act makes provenance a duty

The EU AI Act has, since 2 August 2025, required providers of general-purpose AI to give reliable information about their training data for the first time. At its core is a sufficiently detailed summary of training content using a binding AI Office template (Article 53(1)(d)) plus a copyright policy (Article 53(1)(c)). From 2 August 2026 the Commission can enforce breaches with fines.

Layered diagram of data-provenance governance with data sources, provider duties under the EU AI Act and company due diligence
Data-provenance governance in three layers: from the data source through provider duties under Article 53 to the due diligence of deploying companies.

The AI Office template requires three blocks: general model information, a list of data sources including the top 10 percent of domain names for web-scraped content, and processing details including how opt-outs under the Copyright Directive were respected. Around 24 providers have signed the accompanying GPAI Code of Practice, among them Anthropic, Google, Microsoft, Mistral and OpenAI. Meta has not signed.

Key point

The summary is not a marketing text but a verifiable document. For models placed on the market before 2 August 2025 the duty applies only from 2 August 2027, while it already applies to new models. Anyone procuring models can request these documents and hold them against the marketing.

How the GPAI duties fit into the larger EU AI Act timeline is shown by innobu in the article on the EU AI Act high-risk deadlines through 2027 and 2028 .

German perspective: Section 44b UrhG and the opt-out

In Germany, Section 44b UrhG governs text and data mining and implements the EU exception: reproducing lawfully accessible works is permitted unless the rightsholder declares a reservation of use. For online works this reservation is only valid if it is machine-readable. This is exactly where the dispute sits in 2026.

Reservation of use is a rightsholder's declaration that their online works may not be used for text and data mining. Online it is only valid when machine-readable, for example via robots.txt or the TDM-Rep protocol.

The Hamburg Regional Court ruled, in the case of photographer Robert Kneschke against dataset creator LAION, that AI training is not categorically outside the scope of Section 44b. The court also considered it conceivable that a reservation expressed in natural language can suffice as machine-readable, because modern AI can interpret language.

What this means for your own content: Anyone who does not want to end up in training data must set the reservation actively and machine-readably. In practice this works today via robots.txt and the W3C TDM-Rep protocol. A mere notice in an imprint is, by common reading, not enough.

For companies that process personal data, data protection adds another layer, since training and input data can fall under the GDPR. How AI security and data protection interlock in a European context is explored by innobu in the article on AI security and data protection for the enterprise .

What the 1.5-billion settlement signals

The settlement in Bartz v. Anthropic shows that the biggest risk is not use as such, but the way the data was acquired. Anthropic pays 1.5 billion US dollars into a settlement fund, around 500,000 book titles are covered, and rightsholders can expect at least about 3,000 US dollars per title. It is the largest copyright settlement in the US.

Row of worn hardback books on a library shelf, one volume pulled half out with a paper slip tucked between the spines
Books as training material: the US case turned on works sourced from shadow libraries, not on model use itself.
1.5bn
US-dollar settlement fund
~500,000
book titles covered
~3,000
US dollars per title

Judge William Alsup ruled in 2025 that training on legally acquired books can be fair use, but downloading from shadow libraries such as LibGen is not. In early 2026 around 51 AI copyright suits were pending in US courts. The message for providers and customers: the origin and acquisition path of training data are litigable, not just model use.

Key point

Fair use is a US concept and does not apply in the EU. For European companies what counts is the acquisition path plus the opt-out situation under Section 44b UrhG, not a provider's blanket fair-use assumption.

Challenges and risks

The topic is not one-sided. Fully licensing all training data is expensive and barely workable in practice, and part of the industry sees web scraping as a necessary basis for innovation. At the same time, companies that deploy such models carry a real residual risk.

Arguments from the provider side
Common Crawl has been an industry standard for web data for years
A US court treated training on lawfully accessible works partly as fair use
Full licensing would disadvantage smaller providers
Critical counter-voices
Some consider the transparency template too vague
Rightsholders see their works used without compensation
The burden of proof effectively shifts to the creators

The real company risk: Model providers often disclose their data lineage only partly, indemnities apply only under conditions, and in regulated sectors an unclear data lineage can block procurement. Skipping the check does not remove the risk, it only hides it.

What companies should do now

Companies should build data provenance firmly into AI procurement instead of relying on marketing claims. Three levers are immediately actionable: vendor due diligence, contractual protection and your own data governance.

Two colleagues at a meeting table reviewing a printed AI vendor contract, with an open laptop beside them
Provenance belongs in procurement: vendor due diligence and the contract decide the residual risk of deploying AI.
  1. Check the provider

    Request the GPAI training-data summary and the copyright policy, and compare them against the marketing claims. Look at the named data sources and the handling of opt-outs, not at slogans such as clean or licensed.

  2. Protect via contract

    Use indemnification clauses. Microsoft's Copilot Copyright Commitment, Google Cloud, OpenAI's Copyright Shield and Anthropic offer them, each with conditions such as active safety filters and rights to the input data. Clarify what the indemnity really covers.

  3. Order your own data

    When fine-tuning your own models, document the provenance of your training data and respect third-party opt-outs via robots.txt and TDM-Rep. This stops you from carrying the problem into your own models.

  4. Match sector and workload

    Decide where an unclear data lineage is acceptable and where it is not. In finance, healthcare and government, data provenance should be part of the sign-off before a model goes into production.

Key point

In 2026, data provenance is no longer a niche topic for lawyers but part of any serious AI procurement. Checking providers, protecting contracts and ordering your own data lowers your risk markedly without slowing AI adoption.

Further reading

Frequently asked questions

Why is AI training-data provenance a risk in 2026? +

Because legal teams check how a model was trained before deploying it. Microsoft marketed its MAI models as built on clean, licensed data, yet its own paper lists Common Crawl with 24.2 billion web-scraped pages. A company that buys a model indirectly inherits the provider's training-data risks across copyright, contract and reputation.

What does the EU AI Act require on training-data transparency? +

Providers of general-purpose AI must, since 2 August 2025, publish a sufficiently detailed summary of training content using an AI Office template (Article 53(1)(d)) and maintain a copyright policy (Article 53(1)(c)). From 2 August 2026 the Commission can enforce breaches with fines.

What does Germany's Section 44b UrhG mean for AI training? +

Section 44b UrhG, which implements the EU text-and-data-mining exception, permits mining of lawfully accessible works unless the rightsholder declares a reservation of use. For online works this reservation is only valid if machine-readable, for example via robots.txt or the TDM-Rep protocol. The Hamburg Regional Court held that AI training is not categorically outside its scope.

What did the 1.5-billion-dollar Anthropic settlement involve? +

In Bartz v. Anthropic, Anthropic pays 1.5 billion US dollars into a settlement fund, around 500,000 book titles are covered, and rightsholders can expect at least about 3,000 US dollars per title. Judge Alsup ruled in 2025 that training on legally acquired books can be fair use, but downloading from shadow libraries is not.

What should companies do now? +

Three steps are immediately actionable: first, request the provider's GPAI training-data summary and copyright policy and compare them against the marketing; second, use indemnification clauses in contracts, such as Microsoft's Copilot Copyright Commitment, OpenAI's Copyright Shield or the offers from Google and Anthropic, each with conditions; third, when fine-tuning your own models, document data provenance and respect third-party opt-outs.