An audit fails on a Tuesday afternoon. Your robots.txt blocks GPTBot but allows OAI-SearchBot, and you cannot remember which one matters for ChatGPT Search citations. A second tab is open to Anthropic's docs because someone on the team mentioned anthropic-ai is deprecated. You have nine other crawlers to look up. Each operator publishes its own page in its own format with its own naming convention. The spreadsheet you keep starts to look like a guess.
Stop guessing.
This page is the spreadsheet, but maintained. Every documented AI crawler is in one table, with the exact user-agent token you write into robots.txt, the operator, the job that crawler actually does, whether the operator commits to robots.txt compliance, and a default policy recommendation that takes a position. Bookmark this. Re-check it before you ship.
The reference table
| User-agent token | Operator | Primary purpose | Respects robots.txt | Official docs | Default recommendation |
GPTBot | OpenAI | Training corpus for foundation models | Yes | platform.openai.com/docs/bots | Allow if you want training inclusion; block if not. Independent of search visibility. |
OAI-SearchBot | OpenAI | ChatGPT Search index and answer surfaces | Yes | platform.openai.com/docs/bots | Allow. Blocking removes you from ChatGPT Search citations. |
ChatGPT-User | OpenAI | User-triggered fetch from ChatGPT and custom GPTs | Yes (per OpenAI) | platform.openai.com/docs/bots | Allow. Blocking breaks live answers your prospects request. |
ClaudeBot | Anthropic | Public-web crawl that may contribute to training | Yes | docs.anthropic.com/en/docs/claude-code/bot-policy | Allow if you want training inclusion; block if not. |
Claude-SearchBot | Anthropic | Search relevance for Claude answers | Yes | docs.anthropic.com/en/docs/claude-code/bot-policy | Allow. The retrieval path for Claude citations. |
Claude-User | Anthropic | User-directed fetch for Claude tool use | Yes | docs.anthropic.com/en/docs/claude-code/bot-policy | Allow. |
anthropic-ai | Anthropic (legacy) | Historical training token | Deprecated | docs.anthropic.com/en/docs/claude-code/bot-policy | Listed for completeness. Anthropic now uses ClaudeBot. Keep an old block rule if you have one; do not write new policy against it. |
PerplexityBot | Perplexity | Search index for Perplexity answers | Yes | docs.perplexity.ai/guides/bots | Allow for citation visibility. Blocked pages still appear as headline + brief summary. |
Perplexity-User | Perplexity | User-triggered fetch during a Perplexity session | No (documented bypass) | docs.perplexity.ai/guides/bots | Allow. Blocking is mostly symbolic — Perplexity documents that this token can ignore robots.txt for user-initiated URLs. WAF if you actually need to stop it. |
Google-Extended | Opt-out token for Gemini, Bard, and Vertex AI generative use | Yes | developers.google.com/search/docs/crawling-indexing/overview-google-crawlers | Allow if you want generative grounding; block if you object to training. Has no separate IP fetch — it is a control token, not a crawler. | |
Googlebot | Google Search index | Yes | developers.google.com/search/docs/crawling-indexing/googlebot | Allow. Blocking Google-Extended does not affect Search; do not conflate them. | |
Bytespider | ByteDance | Training and TikTok ranking signals | Partial; thin transparency | Search Engine Journal — Bytespider | Block. Aggressive crawl patterns, weak public commitments. |
CCBot | Common Crawl | Open dataset feeding many third-party AI training pipelines | Yes | commoncrawl.org/ccbot | Block if you object to AI training. Common Crawl is the upstream for dozens of model corpora. |
Applebot | Apple | Spotlight, Siri, Safari Suggestions | Yes | support.apple.com/en-us/119829 | Allow. Drives Apple device discovery surfaces. |
Applebot-Extended | Apple | Apple Intelligence training opt-out | Yes | support.apple.com/en-us/119829 | Block if you want Apple Search but not training. The split mirrors Google's. |
Bingbot | Microsoft | Bing index, also feeds Copilot answers | Yes | bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0 | Allow. Bingbot is the indirect source of most Copilot grounding. |
DuckAssistBot | DuckDuckGo | Retrieval for DuckAssist answers | Yes | duckduckgo.com/duckduckgo-help-pages/results/duckassistbot | Allow. On-demand fetch tied to user queries. |
Amazonbot | Amazon | Powers Alexa and Rufus; may be used for AI training | Yes | developer.amazon.com/amazonbot | Allow if you sell anything Rufus might surface. Otherwise neutral. |
Meta-ExternalAgent | Meta | Training data and Meta's independent search index | Yes | developers.facebook.com/docs/sharing/webmasters/web-crawlers | Block if you object to training. Meta documents both jobs against this single token, which is unhelpful. |
Meta-ExternalFetcher | Meta | User-triggered fetch from Meta AI products | Partial; documented bypass for user URLs | developers.facebook.com/docs/sharing/webmasters/web-crawlers | Allow. Same logic as Perplexity-User. |
cohere-ai | Cohere | Inferred to be product fetch; vendor-undocumented | Unknown | No vendor docs page | Block. No transparency, no documented purpose. |
Diffbot | Diffbot | Knowledge graph crawl sold as a product | Yes (current); historical complaints | diffbot.com/dev/docs/crawl | Block unless you are a Diffbot customer. The graph is a commercial product. |
Timpibot | Timpi | Decentralized index, used as LLM training input | Reported non-compliant in the wild | timpi.io | Block. WAF as well. Reports of robots.txt being ignored go back to 2022. |
YouBot | You.com | Crawl that powers You.com AI Search | Yes | docs.you.com/youbot | Allow. Citation surface for a small but real audience. |
What "purpose" actually means here
Three categories matter. The first is training: the crawler harvests pages to be folded into a model's parameters during the next pretraining or fine-tuning cycle. GPTBot, ClaudeBot, Bytespider, and CCBot are training crawlers. Block them and your content is less likely to be in a future model's weights. Allow them and the inverse.
The second is search retrieval. These crawlers build a live index that an answer engine queries at inference time. OAI-SearchBot, Claude-SearchBot, PerplexityBot, and Bingbot are the main ones. This is the path to citations. Block these and your URL stops appearing as a source.
The third is user-triggered fetch. A human pasted a URL or asked a question that required this specific page. ChatGPT-User, Claude-User, Perplexity-User, Meta-ExternalFetcher, DuckAssistBot, and Amazon's Amzn-User sit here. Two operators (Perplexity and Meta) explicitly document that these tokens can ignore robots.txt when a user names the URL. Robots.txt blocks on this category are advisory at best.
The default policy most sites should adopt
Allow every retrieval and user-triggered crawler. Decide training case by case based on whether your content is your moat. If it is (original research, paid courseware, premium reporting), block training tokens (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, Meta-ExternalAgent). If you publish a marketing site, allow training too; the upside of being in the model is real and the downside is small.
Always block crawlers without transparency commitments (Bytespider, Timpibot, cohere-ai). The cost is zero. If a crawler ignores robots.txt in the wild, escalate to WAF rules. User-agent strings spoof trivially, so verify by reverse DNS or by the IP ranges operators publish. Perplexity, Amazon, OpenAI, and DuckDuckGo all publish JSON IP lists at documented endpoints.
One trap to avoid: do not collapse Google-Extended and Googlebot into a single rule. They are independent. Same for Applebot and Applebot-Extended. Blocking the parent crawler kills your Search visibility; blocking the -Extended token only opts you out of generative use.
Verify before shipping
Robots.txt is text in a file until something reads it. isitready.dev parses your live robots.txt, cross-references every token in this table, and flags the contradictions: a Disallow: / that catches OAI-SearchBot you meant to allow, an Applebot block written when the author meant Applebot-Extended, a missing rule for Perplexity-User that lets the WAF answer instead. Run the audit before the next deploy.