AI crawlers are now a significant share of crawl traffic, and they serve two distinct purposes: training language models and retrieving content for answer generation. Your robots.txt policy controls both — but they're different operations, and blanket blocking often harms you more than it helps.

The crawlers you need to know

Each major AI company runs at least one crawler, and several run two with different jobs:

  • GPTBot (GPTBot): OpenAI's training crawler, announced August 2023. Blocking it prevents OpenAI from training on your content.

  • OAI-SearchBot (OAI-SearchBot): OpenAI's retrieval crawler for ChatGPT search. Blocking it prevents ChatGPT from citing your pages when answering user questions.

  • ChatGPT-User (ChatGPT-User): Used when a ChatGPT user enables web browsing in a conversation. Blocking it prevents real-time retrieval during user sessions.

  • ClaudeBot (ClaudeBot): Anthropic's crawler for training and general knowledge. Distinct from the browsing agent.

  • Claude-Web (Claude-Web): Anthropic's browsing agent, used for live retrieval in Claude conversations.

  • Google-Extended (Google-Extended): Google's AI training crawler for Gemini and Vertex AI, announced September 2023. Fully separate from Googlebot — blocking it has no effect on regular Google Search indexing.

  • PerplexityBot (PerplexityBot): Perplexity's crawler, which powers AI search result summaries.

  • Bytespider (Bytespider): ByteDance's crawler, used for AI training. Fewer transparency commitments than the crawlers above.

  • CCBot (CCBot): Common Crawl's crawler. The dataset it builds is widely used as AI training data by many organizations.

  • Applebot-Extended (Applebot-Extended): Apple's AI training extension of Applebot, used for Apple Intelligence features.

Training vs. retrieval: a meaningful distinction

The decision to block a crawler depends on what it's doing. Blocking GPTBot prevents OpenAI from training on your content, but it also means ChatGPT can't retrieve your page when a user asks a relevant question. Allowing OAI-SearchBot while blocking GPTBot is a reasonable middle position: your content shows up in ChatGPT answers without contributing to OpenAI's training corpus.

Google-Extended is worth calling out specifically: many teams assume blocking it harms SEO, but Googlebot (which handles Search indexing) is a completely separate user-agent. You can block Google-Extended and maintain full Google Search visibility. The same split applies to Applebot vs. Applebot-Extended.

robots.txt is advisory, not enforced. Reputable companies — OpenAI, Anthropic, Google, Perplexity — respect it consistently. Bytespider and CCBot have weaker track records; blocking them is low-cost and reduces exposure to training pipelines with less transparency.

A practical policy example

The naive approach is Disallow: / for all AI crawlers. This blocks training, but it also kills your answer visibility in ChatGPT, Claude, and Perplexity — the same tools your users and prospects rely on for research.

A more defensible default: allow retrieval crawlers, selectively block training crawlers where you have a policy reason, and always block crawlers with no transparency commitments.

User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/

# Allow AI crawlers for citation/answer visibility
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Allow Google AI (separate from Googlebot)
User-agent: Google-Extended
Allow: /

# Block ByteDance/TikTok training
User-agent: Bytespider
Disallow: /

What to protect explicitly

Regardless of your policy toward AI crawlers, certain paths should be blocked for all crawlers: /api/, /admin/, internal tooling routes, and any URL that requires authentication to be useful. Exposing these to crawlers doesn't help your visibility and may surface information you'd rather not index.

Verify before shipping

isitready.dev evaluates your robots.txt against all known AI crawler user-agents and flags policy gaps or accidental blocks — cases where a rule intended to block training also blocks retrieval, or where a crawler is unlisted and defaults to full access.