Kai Zhu · Bocconi University · Interactive Explorer

Adverse Selection in the AI Data Commons

As AI companies harvest web content without compensation, the highest-quality publishers are opting out first. The result: a textbook lemons problem that systematically degrades AI training data.

0
Media Sites Studied
0
Quality-Blocking Gap
0
High-Factual Block Rate
0
Quality Degradation
Explore the Findings
The Core Problem

A Market Failure in AI Training Data

Generative AI derives its power from high-quality web content, but no systematic market compensates the producers. The result is a classic tragedy of the commons.

🚧

The Lemons Problem

Content producers face a binary choice: permit AI crawlers or block them via robots.txt. No compensation, no negotiation, no middle ground.

  • 1High-quality producers exit first — They have the most to lose from uncompensated content extraction
  • 2Low-quality sources stay — Misinformation, propaganda, and pseudoscience have little incentive to block
  • 3Composition degrades — Not how many leave, but which ones leave determines quality
🔍

Our Approach

We study 9,611 media and news sites with expert quality ratings, tracking their AI-blocking decisions over 45 months.

  • 1Cross-section (Feb 2026): robots.txt blocking status for all sites against 23 AI crawlers
  • 2Panel (Jun 2022 – Feb 2026): 7,002 sites tracked monthly = 272,458 site-month observations
  • 3Natural experiment: GPTBot announcement (Aug 2023) as an exogenous salience shock
  • 4Four complementary quality measures: editorial + web-authority, content-based + domain-based
⚠️
Key insight: robots.txt opt-out replaces price signals with binary choice, producing textbook adverse selection (Akerlof, 1970)
Institutional Background

The robots.txt Protocol & AI Crawlers

Since 1994, robots.txt lets websites specify which bots may access their content. With the rise of AI, publishers now face a critical distinction.

Training Crawlers

GPTBot, Google-Extended, ClaudeBot

Permanently incorporate content into model weights. This is value extraction — content becomes part of the AI with no attribution or compensation.

Search Crawlers

OAI-SearchBot, PerplexityBot

Index content for AI-powered search results. This is value creation — can drive traffic back to the original source.

Key Events Timeline

August 2023 — GPTBot Launched Salience Shock

OpenAI's announcement made AI-specific blocking salient and actionable for the first time. Before this, <10% of sites blocked any AI crawler.

December 2023 — NYT v. OpenAI Lawsuit

New York Times sued OpenAI for copyright infringement, escalating publisher awareness of content appropriation.

July 2024 — Cloudflare One-Click AI Block

Cloudflare launched a one-click tool to block all AI crawlers, dramatically lowering the technical barrier to opt-out.

Result 1

The Quality-Blocking Gradient

Blocking rises monotonically with content quality — the textbook signature of adverse selection.

47%
High-factual outlets blocking rate
8%
Low-factual outlets blocking rate
6x
Quality-blocking gap
+17.6pp
Per SD content quality (p<0.01)
fig_quality_gradient
Figure 1: AI-crawler blocking rates by MBFC factual reporting level (left) and credibility rating (right). The gradient is steep and monotonic: 8% (low-factual) to 47% (high-factual). Consistent across all four quality measures.
Robustness: The gradient persists within all site-size terciles (ruling out pure resources explanation), in logit models, balanced panels, and 1,000-permutation placebo tests (p < 0.001).
Result 2

The Misinformation Asymmetry

Credible outlets opt out at nearly 6 times the rate of questionable sources. Misinformation stays freely available for AI training.

Credible vs Questionable
34.8%
Credible block
vs
6.1%
Questionable block
5.7x misinformation asymmetry
fig_misinfo_flags
By flag type: Conspiracy, propaganda, pseudoscience, and fake news sources all block at <12% — far below the 35% credible average.
fig_misinfo_asymmetry
The asymmetry is a downstream consequence of quality-dependent blocking. Conditional on content quality, questionable-source status has no independent effect — misinformation stays because it's low quality, and low quality doesn't block.
Implication: AI systems trained on this degraded corpus produce systematically less factual outputs — not because they choose misinformation, but because credible alternatives withdrew.
Result 3

Ideological Sorting

Centrist outlets block most; political extremes block least. AI training data is becoming ideologically skewed.

24%
Left
39%
Left-Center
51%
Center
49%
Right-Center
11%
Right
fig_political
Inverted-U pattern: The center blocks most, both poles block least. After controlling for quality, Left is -23pp and Right is -19pp below Center (both p<0.01). Political orientation has independent effects beyond quality.
AI training data is being ideologically sorted: Centrist, fact-checked journalism exits while partisan outlets on both sides remain freely accessible. This creates a systematic bias in the information AI systems can learn from.
Result 4

When Did Adverse Selection Emerge?

The quality-blocking gradient barely existed before GPTBot. The August 2023 announcement was the catalyst.

fig_event_study
Event study: Quality × monthly dummies relative to GPTBot announcement. Pre-period: small, flat gradient. Post-August 2023: sharp, persistent amplification. Post-period coefficients are 5–6x larger than pre-period across all four quality measures. Bootstrap 95% CIs exclude unity.
5–6x
Post-period coefficients vs pre-period
<10%
Sites blocking any AI bot before Aug 2023
>50%
High-factual blocking within 12 months
Result 5

Strategic Targeting: Training vs Search

High-quality publishers don't reject AI indiscriminately. They selectively block value extraction while permitting value creation.

45%
High-factual block training bots
32%
High-factual block search bots
38%
Training coefficients larger than search
fig_train_search_coefs
Regression coefficients: Training coefficients ~38% larger across all four quality measures. Consistent in cross-section, panel, and site+time FE.
Not indiscriminate technology rejection: Publishers block value extraction (training) while maintaining AI search visibility (value creation). This sophisticated strategic behavior is consistent with theory: training appropriation costs exceed search costs, and the gap grows with quality.
Counterfactual Analysis

Decomposing the Quality Decline

Is quality degradation caused by the rate of blocking or the pattern? Simulation with 10,000 sites and 1,000 Monte Carlo replications reveals the answer.

fig_sim_quality
Quality distribution under random vs observed blocking at the same overall rate. Both remove ~41% of tokens. Only observed (quality-dependent) blocking shifts the distribution leftward. The entire quality decline comes from adverse selection — who blocks, not how many.
-24.3%
High-factual content share decline (observed blocking)
0%
Decline under random blocking at same rate
-7.0%
Decline with licensing deal (halve top-tier blocking)
11x
Token volume: high-factual vs low-factual sites
Double penalty: Adverse selection imposes both a quantity penalty (shared by any blocking regime, ~41% token loss) and a compositional penalty (unique to quality-dependent blocking). The compositional penalty accounts for the entire quality degradation.
Implications & Future Directions

Designing Markets for AI Training Data

Opt-out regimes produce lemons equilibria by design. Functioning compensation markets are needed to reverse quality degradation.

💰

Licensing Frameworks

Replace binary opt-out with a market for data licenses. Our simulation shows licensing (halving top-tier blocking) recovers most quality loss: -7.0% vs -24.3%. Compensation realigns incentives.

🏛️

Collective Action

Lower transaction costs via industry-wide standards — like ASCAP for music. Centralized collective licensing reduces bilateral negotiation costs that block individual deals.

⚖️

Regulatory Design

EU AI Act, copyright reform must account for adverse selection. Opt-out regimes by design produce lemons equilibria. Statutory licensing may be welfare-superior to pure opt-out.

🔍

Preserve the Training-Search Distinction

Publishers strategically differentiate. Licensing frameworks should honor this: separate compensation for training (extraction) and search (creation).

📊

Data Composition Is Endogenous

Training data quality is not a fixed input. It responds to governance. The commons degrades by design when producers' exit decisions shape what remains.

📰

New Misinformation Channel

AI amplifies misinformation not by generating it, but through systematic withdrawal of high-quality sources. As credible outlets exit, the remaining corpus tilts toward unreliable content.

Key Takeaways
Textbook adverse selection: High-quality media blocks AI crawlers at 6x the rate of low-quality sources. The gradient is steep, monotonic, and robust.
Misinformation stays open: Conspiracy, propaganda, pseudoscience block at <12% while credible journalism blocks at 35%. AI training data inherits a distorted information environment.
Centrist voices exit: Inverted-U pattern — center blocks at 51%, both political extremes below 25%. AI data becoming ideologically sorted.
GPTBot was the catalyst: Post-announcement coefficients 5–6x larger. Social learning and cascading adoption widened the gradient over time.
Strategic, not indiscriminate: Publishers block training (value extraction) at 38% higher rates than search (value creation).
Selection, not volume: The entire 24.3% quality decline comes from who blocks, not how many. Random blocking at the same rate produces zero quality degradation.

"As the best sources withdraw, AI training data becomes less factual, less credible, and more ideologically skewed. Without institutional solutions — functioning compensation markets, collective licensing, regulatory reform — the degradation becomes self-reinforcing."