Two questions get tangled in every discussion about AI control files. One: how do I stop AI companies from training on my content? Two: how do I help AI assistants find and quote the right pages? Different problems. Different files. Conflating them is why most ship the wrong thing.

The two jobs, kept separate

Opt-out from training is a permission signal. It tells a crawler: do not ingest this for model weights. The audience is the company running the crawler. The action you want is restraint.

Discovery for AI assistants is the opposite. It tells a chat model or agent: these are the pages that represent us. Read them first. The audience is the assistant at inference time. The action you want is engagement.

Most posts on the topic muddle these into a generic "AI file." They are not the same job. The proposals below each pick one.

The four proposals on the table

ProposalFile / mechanismJobWho reads itStatus (April 2026)
robots.txt with AI tokens/robots.txt + User-agent lines (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, OAI-SearchBot, PerplexityBot)Opt-out from training and/or AI searchMajor AI companies' crawlersDe facto standard. Honored by OpenAI, Anthropic, Google, Apple, Perplexity.
llms.txt/llms.txt (Markdown)Help AI assistants find canonical docsInference-time agents and LLM clientsCommunity proposal (Howard, Sept 2024). 844k+ sites per BuiltWith Oct 2025. No major AI vendor confirmed it as an inference input.
ai.txt (Spawning)/ai.txtOpt-out from generative AI training of mediaSpawning's API + partners (Hugging Face, Stability AI)Active inside Spawning's network. Limited reach outside it.
IETF AIPREFVocabulary draft + HTTP attachment draftStandardize machine-readable AI usage preferencesFuture-compliant crawlersWorking group chartered March 2025. Original August 2025 ship date missed. Vocab finalization targeted for August 2026.
TDM Reservation Protocoltdm-reservation + tdm-policy metadataArticle 4 EU CDSM opt-out for text and data miningEU-bound TDM and AI training operatorsW3C Community Group final report. Cited in EU AI Act Article 53.
C2PA Content CredentialsSigned manifest embedded in media filesProvenance, not opt-outVerifiers, platforms, AI labelersC2PA Spec v2.2. 6,000+ members. Recommended in EU AI Act Article 50 transparency rules from Aug 2026.

Six rows. Two distinct jobs. Different layers of the stack.

What each one looks like

robots.txt with AI user-agents is the workhorse. Short, supported, boring:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow AI search and grounding
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

Spawning's /ai.txt uses a different syntax aimed at media downloads rather than crawl events:

# Spawning AI - ai.txt v1.0
User-Agent: *
Disallow: /

llms.txt is curated Markdown that points at canonical pages instead of forbidding them:

# isitready.dev

> Public website scanning for AI readiness, SEO, security,
> performance, and production quality.

## Docs
- [Methodology](https://isitready.dev/methodology)
- [FAQ](https://isitready.dev/faq)

C2PA is not a root-level file at all. It's a cryptographically signed manifest embedded in JPEG, MP4, PDF, or other binary assets, built on X.509 (RFC 5280), CBOR (RFC 8949), and JUMBF (ISO 19566-5). Samsung's Galaxy S25 signs photos with it natively. Microsoft Bing labels AI images with it. Different layer entirely.

TDMRep adds two HTTP headers or HTML meta tags. tdm-reservation: 1 reserves rights. tdm-policy: <url> points at an ODRL document with terms. The EU AI Act Article 53 obliges general-purpose model providers to honor these signals, regardless of where training happens.

What the IETF group is actually building

AIPREF is the only effort with a chance of becoming a real standard. The working group met first in March 2025. Its two drafts — draft-ietf-aipref-vocab-05 and draft-ietf-aipref-attach — define a vocabulary of purpose-based preferences ("do not use this content for X") and ways to attach those preferences to HTTP responses, including a robots.txt extension under RFC 9309.

The August 2025 publication target slipped. Notes from the IETF 125 Toronto meeting in April 2026 put vocabulary finalization in the "slight chance by August 2026" bucket. W3C TDMRep is waiting on the AIPREF vocabulary before integrating it into TDM Policy. So the eventual unified opt-out lives somewhere downstream of AIPREF + TDMRep + robots.txt — not in any single file shipping today.

What to ship right now

Three moves cover most production sites:

  1. Ship robots.txt with explicit AI User-agent blocks. Decide whether you want to be in training corpora (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot) and whether you want to appear in AI search and answers (OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-SearchBot). These are the only signals the major vendors have publicly committed to honoring. Around 21% of the top 1,000 sites already have GPTBot rules; ClaudeBot is blocked by roughly 69% of sites that bother to set policy.

  2. Ship llms.txt if you want AI assistants citing the right pages. It does a different job. Docs sites, developer tools, and SaaS products win the most. Anthropic publishes one at docs.anthropic.com/llms.txt plus an llms-full.txt clocking 481,349 tokens. Cloudflare and Stripe ship them too. Whether AI vendors actually fetch these at inference is unconfirmed; the cost to publish is near zero.

  3. Skip Spawning's /ai.txt unless you use Spawning's tools. The signal flows through Spawning's API to partners like Hugging Face and Stability AI. Outside that network, the file is mostly decorative. The same opt-out intent is better expressed via robots.txt User-agent blocks.

EU publishers and rightsholders carrying media assets should also implement TDMRep. The EU AI Act gives that signal legal teeth that the others lack. C2PA matters for anyone publishing photos or video — provenance is its own problem domain, not an opt-out lever.

Watch AIPREF. When the vocabulary lands and major crawlers commit, that becomes the file to ship. Until then, robots.txt is the only opt-out signal with confirmed honoring across the big four.

Verify before shipping

Run isitready.dev against your origin. The scan reads /robots.txt for AI user-agent coverage, validates /llms.txt and /llms-full.txt, checks alignment between the URLs in llms.txt and your sitemap, and flags missing or stale tokens against the current crawler list. Fix what it flags. Re-run. Ship.