AI Data Commons
Interactive paper guide

Adverse Selection in the AI Data Commons

Generative AI depends on high-quality web content, but the current opt-out regime gives producers a binary choice. The paper shows that the highest-quality producers are the first to leave.

9,611
MBFC-rated media and news sites
47%
High-factual outlets block AI crawlers
8%
Low-factual sources block AI crawlers
24.3%
Simulated decline in high-factual content share
The commons after opt-out
Stylized from paper findings
All media sites before blocking
Accessible corpus after blocking
Low factual Mixed Mostly factual High factual
Motivation

AI training data is a market without a market.

The paper frames accessible web content as an AI data commons: a shared input that improves models, search, summaries, recommendations, and downstream consumer information.

No systematic compensation

Licensing deals exist for some publishers, but most content producers receive no direct payment when their work contributes to training corpora.

Binary governance

The practical tool is `robots.txt`: permit AI crawlers or block them. That simple choice replaces prices, contracts, and quality-sensitive bargaining.

Composition matters

The central question is not just how much content is blocked. It is which content exits and what remains available for AI systems.

Theory lab

Why high-quality producers exit first.

In the model, quality raises both visibility benefits and appropriation costs. When uncompensated training makes appropriation costs grow faster, a threshold emerges: producers above it block.

0.00
Blocking threshold q*
0%
Remaining quality loss

Net incentive to remain accessible

Above zero: block
Empirical evidence

Six findings, one pattern: the commons tilts away from credible content.

The study combines live `robots.txt` scraping with Media Bias/Fact Check ratings and a 45-month HTTP Archive panel from June 2022 to February 2026.

Quality gradient

Blocking rates
Temporal dynamics

The gradient appears when AI-specific opt-out becomes salient.

Event studies trace the divergence to the August 2023 GPTBot announcement. Before that moment, blocking was uniformly low and the quality gradient was small.

June 2022

Panel begins. AI-specific blocking is rare, with fewer than 10% of sites blocking any AI crawler.

August 2023 - GPTBot announcement

OpenAI makes AI-specific opt-out visible and actionable. High-factual sites move first and fastest.

December 2023 - NYT v. OpenAI

Copyright concerns become more salient across the media sector.

July 2024 - one-click blocking

Cloudflare lowers the operational cost of opting out of AI crawling.

February 2026

The cross-section covers 9,611 sites; the panel covers 7,002 sites and 272,458 site-month observations.

Post-period amplification

Mean coefficient ratio
Counterfactual simulator

The problem is selective exit, not blocking alone.

The paper's calibrated simulation compares blocking regimes with the same or different levels of total withdrawal. Random blocking removes volume; selective blocking removes quality.

Current observed

Decline relative to no-blocking baseline

Observed selective blocking

Token-weighted quality decline0%
High-factual share decline0%
Volume withdrawn0%
Implications

Binary opt-out creates a lemons equilibrium by design.

The paper points to governance mechanisms that preserve the value of AI search while compensating content producers for training use.

Licensing markets

Replacing binary opt-out with compensation can recover most of the simulated high-factual content loss.

Collective bargaining

Shared licensing infrastructure can reduce transaction costs between AI firms and publishers.

Crawler differentiation

Publishers distinguish training crawlers from search crawlers, so governance should preserve that distinction.

Data composition

Training data quality is endogenous to policy. The data commons changes when incentives change.

The quality of AI-mediated information depends not only on model design, but on whether the best sources have a reason to remain in the commons.