Adverse Selection in the AI Data Commons
Generative AI depends on high-quality web content, but the current opt-out regime gives producers a binary choice. The paper shows that the highest-quality producers are the first to leave.
AI training data is a market without a market.
The paper frames accessible web content as an AI data commons: a shared input that improves models, search, summaries, recommendations, and downstream consumer information.
No systematic compensation
Licensing deals exist for some publishers, but most content producers receive no direct payment when their work contributes to training corpora.
Binary governance
The practical tool is `robots.txt`: permit AI crawlers or block them. That simple choice replaces prices, contracts, and quality-sensitive bargaining.
Composition matters
The central question is not just how much content is blocked. It is which content exits and what remains available for AI systems.
Why high-quality producers exit first.
In the model, quality raises both visibility benefits and appropriation costs. When uncompensated training makes appropriation costs grow faster, a threshold emerges: producers above it block.
Net incentive to remain accessible
Above zero: blockSix findings, one pattern: the commons tilts away from credible content.
The study combines live `robots.txt` scraping with Media Bias/Fact Check ratings and a 45-month HTTP Archive panel from June 2022 to February 2026.
Quality gradient
Blocking ratesThe gradient appears when AI-specific opt-out becomes salient.
Event studies trace the divergence to the August 2023 GPTBot announcement. Before that moment, blocking was uniformly low and the quality gradient was small.
June 2022
Panel begins. AI-specific blocking is rare, with fewer than 10% of sites blocking any AI crawler.
August 2023 - GPTBot announcement
OpenAI makes AI-specific opt-out visible and actionable. High-factual sites move first and fastest.
December 2023 - NYT v. OpenAI
Copyright concerns become more salient across the media sector.
July 2024 - one-click blocking
Cloudflare lowers the operational cost of opting out of AI crawling.
February 2026
The cross-section covers 9,611 sites; the panel covers 7,002 sites and 272,458 site-month observations.
Post-period amplification
Mean coefficient ratioThe problem is selective exit, not blocking alone.
The paper's calibrated simulation compares blocking regimes with the same or different levels of total withdrawal. Random blocking removes volume; selective blocking removes quality.
Current observed
Decline relative to no-blocking baselineObserved selective blocking
Binary opt-out creates a lemons equilibrium by design.
The paper points to governance mechanisms that preserve the value of AI search while compensating content producers for training use.
Licensing markets
Replacing binary opt-out with compensation can recover most of the simulated high-factual content loss.
Collective bargaining
Shared licensing infrastructure can reduce transaction costs between AI firms and publishers.
Crawler differentiation
Publishers distinguish training crawlers from search crawlers, so governance should preserve that distinction.
Data composition
Training data quality is endogenous to policy. The data commons changes when incentives change.
The quality of AI-mediated information depends not only on model design, but on whether the best sources have a reason to remain in the commons.