An audit fails on a Tuesday afternoon. Your robots.txt blocks GPTBot but allows OAI-SearchBot, and you cannot remember which one matters for ChatGPT Search citations. A second tab is open to Anthropic's docs because someone on the team mentioned anthropic-ai is deprecated. You have nine other crawlers to look up. Each operator publishes its own page in its own format with its own naming convention. The spreadsheet you keep starts to look like a guess.

Stop guessing.

This page is the spreadsheet, but maintained. Every documented AI crawler is in one table, with the exact user-agent token you write into robots.txt, the operator, the job that crawler actually does, whether the operator commits to robots.txt compliance, and a default policy recommendation that takes a position. Bookmark this. Re-check it before you ship.

The reference table

User-agent token	Operator	Primary purpose	Respects robots.txt	Official docs	Default recommendation
`GPTBot`	OpenAI	Training corpus for foundation models	Yes	platform.openai.com/docs/bots	Allow if you want training inclusion; block if not. Independent of search visibility.
`OAI-SearchBot`	OpenAI	ChatGPT Search index and answer surfaces	Yes	platform.openai.com/docs/bots	Allow. Blocking removes you from ChatGPT Search citations.
`ChatGPT-User`	OpenAI	User-triggered fetch from ChatGPT and custom GPTs	Yes (per OpenAI)	platform.openai.com/docs/bots	Allow. Blocking breaks live answers your prospects request.
`ClaudeBot`	Anthropic	Public-web crawl that may contribute to training	Yes	docs.anthropic.com/en/docs/claude-code/bot-policy	Allow if you want training inclusion; block if not.
`Claude-SearchBot`	Anthropic	Search relevance for Claude answers	Yes	docs.anthropic.com/en/docs/claude-code/bot-policy	Allow. The retrieval path for Claude citations.
`Claude-User`	Anthropic	User-directed fetch for Claude tool use	Yes	docs.anthropic.com/en/docs/claude-code/bot-policy	Allow.
`anthropic-ai`	Anthropic (legacy)	Historical training token	Deprecated	docs.anthropic.com/en/docs/claude-code/bot-policy	Listed for completeness. Anthropic now uses `ClaudeBot`. Keep an old block rule if you have one; do not write new policy against it.
`PerplexityBot`	Perplexity	Search index for Perplexity answers	Yes	docs.perplexity.ai/guides/bots	Allow for citation visibility. Blocked pages still appear as headline + brief summary.
`Perplexity-User`	Perplexity	User-triggered fetch during a Perplexity session	No (documented bypass)	docs.perplexity.ai/guides/bots	Allow. Blocking is mostly symbolic — Perplexity documents that this token can ignore robots.txt for user-initiated URLs. WAF if you actually need to stop it.
`Google-Extended`	Google	Opt-out token for Gemini, Bard, and Vertex AI generative use	Yes	developers.google.com/search/docs/crawling-indexing/overview-google-crawlers	Allow if you want generative grounding; block if you object to training. Has no separate IP fetch — it is a control token, not a crawler.
`Googlebot`	Google	Google Search index	Yes	developers.google.com/search/docs/crawling-indexing/googlebot	Allow. Blocking `Google-Extended` does not affect Search; do not conflate them.
`Bytespider`	ByteDance	Training and TikTok ranking signals	Partial; thin transparency	Search Engine Journal — Bytespider	Block. Aggressive crawl patterns, weak public commitments.
`CCBot`	Common Crawl	Open dataset feeding many third-party AI training pipelines	Yes	commoncrawl.org/ccbot	Block if you object to AI training. Common Crawl is the upstream for dozens of model corpora.
`Applebot`	Apple	Spotlight, Siri, Safari Suggestions	Yes	support.apple.com/en-us/119829	Allow. Drives Apple device discovery surfaces.
`Applebot-Extended`	Apple	Apple Intelligence training opt-out	Yes	support.apple.com/en-us/119829	Block if you want Apple Search but not training. The split mirrors Google's.
`Bingbot`	Microsoft	Bing index, also feeds Copilot answers	Yes	bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0	Allow. Bingbot is the indirect source of most Copilot grounding.
`DuckAssistBot`	DuckDuckGo	Retrieval for DuckAssist answers	Yes	duckduckgo.com/duckduckgo-help-pages/results/duckassistbot	Allow. On-demand fetch tied to user queries.
`Amazonbot`	Amazon	Powers Alexa and Rufus; may be used for AI training	Yes	developer.amazon.com/amazonbot	Allow if you sell anything Rufus might surface. Otherwise neutral.
`Meta-ExternalAgent`	Meta	Training data and Meta's independent search index	Yes	developers.facebook.com/docs/sharing/webmasters/web-crawlers	Block if you object to training. Meta documents both jobs against this single token, which is unhelpful.
`Meta-ExternalFetcher`	Meta	User-triggered fetch from Meta AI products	Partial; documented bypass for user URLs	developers.facebook.com/docs/sharing/webmasters/web-crawlers	Allow. Same logic as `Perplexity-User`.
`cohere-ai`	Cohere	Inferred to be product fetch; vendor-undocumented	Unknown	No vendor docs page	Block. No transparency, no documented purpose.
`Diffbot`	Diffbot	Knowledge graph crawl sold as a product	Yes (current); historical complaints	diffbot.com/dev/docs/crawl	Block unless you are a Diffbot customer. The graph is a commercial product.
`Timpibot`	Timpi	Decentralized index, used as LLM training input	Reported non-compliant in the wild	timpi.io	Block. WAF as well. Reports of robots.txt being ignored go back to 2022.
`YouBot`	You.com	Crawl that powers You.com AI Search	Yes	docs.you.com/youbot	Allow. Citation surface for a small but real audience.

What "purpose" actually means here

Three categories matter. The first is training: the crawler harvests pages to be folded into a model's parameters during the next pretraining or fine-tuning cycle. GPTBot, ClaudeBot, Bytespider, and CCBot are training crawlers. Block them and your content is less likely to be in a future model's weights. Allow them and the inverse.

The second is search retrieval. These crawlers build a live index that an answer engine queries at inference time. OAI-SearchBot, Claude-SearchBot, PerplexityBot, and Bingbot are the main ones. This is the path to citations. Block these and your URL stops appearing as a source.

The third is user-triggered fetch. A human pasted a URL or asked a question that required this specific page. ChatGPT-User, Claude-User, Perplexity-User, Meta-ExternalFetcher, DuckAssistBot, and Amazon's Amzn-User sit here. Two operators (Perplexity and Meta) explicitly document that these tokens can ignore robots.txt when a user names the URL. Robots.txt blocks on this category are advisory at best.

The default policy most sites should adopt

Allow every retrieval and user-triggered crawler. Decide training case by case based on whether your content is your moat. If it is (original research, paid courseware, premium reporting), block training tokens (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, Meta-ExternalAgent). If you publish a marketing site, allow training too; the upside of being in the model is real and the downside is small.

Always block crawlers without transparency commitments (Bytespider, Timpibot, cohere-ai). The cost is zero. If a crawler ignores robots.txt in the wild, escalate to WAF rules. User-agent strings spoof trivially, so verify by reverse DNS or by the IP ranges operators publish. Perplexity, Amazon, OpenAI, and DuckDuckGo all publish JSON IP lists at documented endpoints.

One trap to avoid: do not collapse Google-Extended and Googlebot into a single rule. They are independent. Same for Applebot and Applebot-Extended. Blocking the parent crawler kills your Search visibility; blocking the -Extended token only opts you out of generative use.

Verify before shipping

Robots.txt is text in a file until something reads it. isitready.dev parses your live robots.txt, cross-references every token in this table, and flags the contradictions: a Disallow: / that catches OAI-SearchBot you meant to allow, an Applebot block written when the author meant Applebot-Extended, a missing rule for Perplexity-User that lets the WAF answer instead. Run the audit before the next deploy.

The reference table

What "purpose" actually means here

The default policy most sites should adopt

Verify before shipping

Related guides