An audit fails on a Tuesday afternoon. Your robots.txt blocks GPTBot but allows OAI-SearchBot, and you cannot remember which one matters for ChatGPT Search citations. A second tab is open to Anthropic's docs because someone on the team mentioned anthropic-ai is deprecated. You have nine other crawlers to look up. Each operator publishes its own page in its own format with its own naming convention. The spreadsheet you keep starts to look like a guess.

Stop guessing.

This page is the spreadsheet, but maintained. Every documented AI crawler is in one table, with the exact user-agent token you write into robots.txt, the operator, the job that crawler actually does, whether the operator commits to robots.txt compliance, and a default policy recommendation that takes a position. Bookmark this. Re-check it before you ship.

The reference table

User-agent tokenOperatorPrimary purposeRespects robots.txtOfficial docsDefault recommendation
GPTBotOpenAITraining corpus for foundation modelsYesplatform.openai.com/docs/botsAllow if you want training inclusion; block if not. Independent of search visibility.
OAI-SearchBotOpenAIChatGPT Search index and answer surfacesYesplatform.openai.com/docs/botsAllow. Blocking removes you from ChatGPT Search citations.
ChatGPT-UserOpenAIUser-triggered fetch from ChatGPT and custom GPTsYes (per OpenAI)platform.openai.com/docs/botsAllow. Blocking breaks live answers your prospects request.
ClaudeBotAnthropicPublic-web crawl that may contribute to trainingYesdocs.anthropic.com/en/docs/claude-code/bot-policyAllow if you want training inclusion; block if not.
Claude-SearchBotAnthropicSearch relevance for Claude answersYesdocs.anthropic.com/en/docs/claude-code/bot-policyAllow. The retrieval path for Claude citations.
Claude-UserAnthropicUser-directed fetch for Claude tool useYesdocs.anthropic.com/en/docs/claude-code/bot-policyAllow.
anthropic-aiAnthropic (legacy)Historical training tokenDeprecateddocs.anthropic.com/en/docs/claude-code/bot-policyListed for completeness. Anthropic now uses ClaudeBot. Keep an old block rule if you have one; do not write new policy against it.
PerplexityBotPerplexitySearch index for Perplexity answersYesdocs.perplexity.ai/guides/botsAllow for citation visibility. Blocked pages still appear as headline + brief summary.
Perplexity-UserPerplexityUser-triggered fetch during a Perplexity sessionNo (documented bypass)docs.perplexity.ai/guides/botsAllow. Blocking is mostly symbolic — Perplexity documents that this token can ignore robots.txt for user-initiated URLs. WAF if you actually need to stop it.
Google-ExtendedGoogleOpt-out token for Gemini, Bard, and Vertex AI generative useYesdevelopers.google.com/search/docs/crawling-indexing/overview-google-crawlersAllow if you want generative grounding; block if you object to training. Has no separate IP fetch — it is a control token, not a crawler.
GooglebotGoogleGoogle Search indexYesdevelopers.google.com/search/docs/crawling-indexing/googlebotAllow. Blocking Google-Extended does not affect Search; do not conflate them.
BytespiderByteDanceTraining and TikTok ranking signalsPartial; thin transparencySearch Engine Journal — BytespiderBlock. Aggressive crawl patterns, weak public commitments.
CCBotCommon CrawlOpen dataset feeding many third-party AI training pipelinesYescommoncrawl.org/ccbotBlock if you object to AI training. Common Crawl is the upstream for dozens of model corpora.
ApplebotAppleSpotlight, Siri, Safari SuggestionsYessupport.apple.com/en-us/119829Allow. Drives Apple device discovery surfaces.
Applebot-ExtendedAppleApple Intelligence training opt-outYessupport.apple.com/en-us/119829Block if you want Apple Search but not training. The split mirrors Google's.
BingbotMicrosoftBing index, also feeds Copilot answersYesbing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0Allow. Bingbot is the indirect source of most Copilot grounding.
DuckAssistBotDuckDuckGoRetrieval for DuckAssist answersYesduckduckgo.com/duckduckgo-help-pages/results/duckassistbotAllow. On-demand fetch tied to user queries.
AmazonbotAmazonPowers Alexa and Rufus; may be used for AI trainingYesdeveloper.amazon.com/amazonbotAllow if you sell anything Rufus might surface. Otherwise neutral.
Meta-ExternalAgentMetaTraining data and Meta's independent search indexYesdevelopers.facebook.com/docs/sharing/webmasters/web-crawlersBlock if you object to training. Meta documents both jobs against this single token, which is unhelpful.
Meta-ExternalFetcherMetaUser-triggered fetch from Meta AI productsPartial; documented bypass for user URLsdevelopers.facebook.com/docs/sharing/webmasters/web-crawlersAllow. Same logic as Perplexity-User.
cohere-aiCohereInferred to be product fetch; vendor-undocumentedUnknownNo vendor docs pageBlock. No transparency, no documented purpose.
DiffbotDiffbotKnowledge graph crawl sold as a productYes (current); historical complaintsdiffbot.com/dev/docs/crawlBlock unless you are a Diffbot customer. The graph is a commercial product.
TimpibotTimpiDecentralized index, used as LLM training inputReported non-compliant in the wildtimpi.ioBlock. WAF as well. Reports of robots.txt being ignored go back to 2022.
YouBotYou.comCrawl that powers You.com AI SearchYesdocs.you.com/youbotAllow. Citation surface for a small but real audience.

What "purpose" actually means here

Three categories matter. The first is training: the crawler harvests pages to be folded into a model's parameters during the next pretraining or fine-tuning cycle. GPTBot, ClaudeBot, Bytespider, and CCBot are training crawlers. Block them and your content is less likely to be in a future model's weights. Allow them and the inverse.

The second is search retrieval. These crawlers build a live index that an answer engine queries at inference time. OAI-SearchBot, Claude-SearchBot, PerplexityBot, and Bingbot are the main ones. This is the path to citations. Block these and your URL stops appearing as a source.

The third is user-triggered fetch. A human pasted a URL or asked a question that required this specific page. ChatGPT-User, Claude-User, Perplexity-User, Meta-ExternalFetcher, DuckAssistBot, and Amazon's Amzn-User sit here. Two operators (Perplexity and Meta) explicitly document that these tokens can ignore robots.txt when a user names the URL. Robots.txt blocks on this category are advisory at best.

The default policy most sites should adopt

Allow every retrieval and user-triggered crawler. Decide training case by case based on whether your content is your moat. If it is (original research, paid courseware, premium reporting), block training tokens (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, Meta-ExternalAgent). If you publish a marketing site, allow training too; the upside of being in the model is real and the downside is small.

Always block crawlers without transparency commitments (Bytespider, Timpibot, cohere-ai). The cost is zero. If a crawler ignores robots.txt in the wild, escalate to WAF rules. User-agent strings spoof trivially, so verify by reverse DNS or by the IP ranges operators publish. Perplexity, Amazon, OpenAI, and DuckDuckGo all publish JSON IP lists at documented endpoints.

One trap to avoid: do not collapse Google-Extended and Googlebot into a single rule. They are independent. Same for Applebot and Applebot-Extended. Blocking the parent crawler kills your Search visibility; blocking the -Extended token only opts you out of generative use.

Verify before shipping

Robots.txt is text in a file until something reads it. isitready.dev parses your live robots.txt, cross-references every token in this table, and flags the contradictions: a Disallow: / that catches OAI-SearchBot you meant to allow, an Applebot block written when the author meant Applebot-Extended, a missing rule for Perplexity-User that lets the WAF answer instead. Run the audit before the next deploy.