AI crawlers are now a visible chunk of your traffic logs. Some scrape to train models. Others fetch on behalf of a user who just typed your URL into ChatGPT. Your robots.txt can express preferences for most of them, but the semantics differ by provider — and blanket blocks usually take down retrieval paths you wanted to keep alongside the training crawls you wanted to stop.

The crawlers you need to know

Each major AI company runs at least one crawler, and several run two with different jobs:

  • GPTBot (GPTBot): OpenAI's crawler for content that may be used to train generative AI foundation models.

  • OAI-SearchBot (OAI-SearchBot): OpenAI's crawler for surfacing sites in ChatGPT search features.

  • ChatGPT-User (ChatGPT-User): Used for certain user-triggered actions in ChatGPT and custom GPTs. OpenAI says it is not automatic web crawling, and robots.txt may not apply to those user-initiated requests.

  • ClaudeBot (ClaudeBot): Anthropic's crawler for collecting public web content that may contribute to model training.

  • Claude-User (Claude-User): Anthropic's user-directed fetcher for Claude requests that access websites.

  • Claude-SearchBot (Claude-SearchBot): Anthropic's crawler for improving search result relevance and accuracy for users.

  • Google-Extended (Google-Extended): Google's product token for controlling whether content crawled by Google may be used for Gemini model training and grounding. It has no separate HTTP user-agent and does not affect Google Search indexing.

  • PerplexityBot (PerplexityBot): Perplexity's crawler for its search index. Perplexity says blocked pages may still appear by domain, headline, or brief factual summary.

  • Bytespider (Bytespider): ByteDance's crawler, used for AI training. Fewer transparency commitments than the crawlers above.

  • CCBot (CCBot): Common Crawl's crawler. The dataset it builds is widely used as AI training data by many organizations.

  • Applebot-Extended (Applebot-Extended): Apple's AI training extension of Applebot, used for Apple Intelligence features.

Training vs. retrieval: a meaningful distinction

The decision to block a crawler depends on what it's doing. Blocking GPTBot signals that your content should not be used to train OpenAI's generative AI foundation models, but ChatGPT search visibility is controlled separately through OAI-SearchBot. Allowing OAI-SearchBot while blocking GPTBot is a reasonable middle position: your content can remain eligible for ChatGPT search without granting the same permission for training use.

Google-Extended is the one teams get wrong most often. Many assume blocking it harms SEO, but Googlebot — which handles Search indexing — is a separate user-agent. You can block Google-Extended and keep full Google Search visibility. The same split applies to Applebot vs. Applebot-Extended.

robots.txt is advisory, not enforced by the protocol itself. OpenAI, Anthropic, Google, and Perplexity document robots.txt controls for their crawler or product tokens, but site operators should still verify behavior in logs and bot-management tooling. Bytespider and CCBot have weaker transparency for many teams; blocking them is low-cost if you do not want that traffic.

A practical policy example

The naive approach is Disallow: / for all AI-related crawlers. This may block training crawls, but it can also remove search and answer retrieval paths in ChatGPT, Claude, and Perplexity — the same tools your users and prospects may rely on for research.

A more defensible default: allow retrieval crawlers, selectively block training crawlers where you have a policy reason, and always block crawlers with no transparency commitments.

User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/

# Allow AI crawlers for citation/answer visibility
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Allow Google AI training/grounding use (separate from Search indexing)
User-agent: Google-Extended
Allow: /

# Block ByteDance/TikTok training
User-agent: Bytespider
Disallow: /

What to protect explicitly

Regardless of your policy toward AI crawlers, certain paths should be blocked for all crawlers: /api/, /admin/, internal tooling routes, and any URL that requires authentication to be useful. Exposing these to crawlers doesn't help your visibility and may surface information you'd rather not index.

Verify before shipping

isitready.dev evaluates your robots.txt against known AI crawler and product tokens and flags policy gaps or accidental blocks — cases where a rule intended to block training also blocks retrieval, or where a crawler is unlisted and defaults to full access.