A site owner sees GPTBot/1.1 in their access logs, panics, and pastes User-agent: GPTBot / Disallow: / into robots.txt. Six months later their content stops surfacing in ChatGPT answers and they have no idea why. The block worked. It also wasn't the right block.

OpenAI runs three separate crawlers with three separate jobs. GPTBot handles training data for the foundation models behind ChatGPT. OAI-SearchBot builds the index that powers ChatGPT's search citations. ChatGPT-User fetches a single URL when a person types or pastes it into a conversation. Conflating them is the most common robots.txt mistake of the AI era, and OpenAI has documented the split since GPTBot launched in August 2023.

What GPTBot is and is not

GPTBot crawls the public web to gather text that may be used for training OpenAI's generative foundation models. The user-agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot. OpenAI publishes the live IP CIDR ranges as JSON at https://openai.com/gptbot.json, with parallel files at searchbot.json and chatgpt-user.json for the other two agents. Reverse-DNS plus IP match against those JSON files is the only reliable way to confirm a request really came from OpenAI. The user-agent header is trivial to spoof.

GPTBot is not the crawler that powers ChatGPT Search. That job belongs to OAI-SearchBot, documented at developers.openai.com/api/docs/bots. The two read robots.txt independently. Disallowing GPTBot does nothing to OAI-SearchBot, and vice versa. OpenAI states this explicitly: each token is independent.

GPTBot honors Disallow. Its support for Crawl-delay is undocumented and inconsistent in the wild; multiple hosting providers report it ignored across 2025 incidents. If you need real throttling, do it at the edge.

The training vs. retrieval tradeoff

Blocking GPTBot tells OpenAI: don't add my pages to the next training corpus. That's a defensible position for paywalled journalism, proprietary research, original photography, and anything you license commercially. The New York Times, Reuters, and the BBC all block GPTBot. They have specific revenue reasons.

Blocking GPTBot does not, by itself, hide you from ChatGPT. When a user asks ChatGPT a question that triggers web search, the model retrieves results through OAI-SearchBot's index. Those citations link back to your live page. The training-versus-retrieval split means a site can be invisible to model training and still appear as a cited source — provided OAI-SearchBot is allowed.

The reverse failure is more common and more painful. A User-agent: * / Disallow: / rule meant to block scrapers also blocks OAI-SearchBot, which means no ChatGPT citations, ever. Same outcome for an overly aggressive Cloudflare bot-management rule. If your goal is "don't train on me" rather than "don't read me," write the rule that way.

The three policies, with exact syntax

Pick one. Each block is four lines and replaces any existing User-agent: GPTBot stanza.

Allow GPTBot fully. The default for marketing sites and docs that want maximum AI surface area:

User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /api/

Block GPTBot but keep ChatGPT search citations. The right setting for publishers, original-research shops, and licensed-content businesses that still want to show up in answers:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

Rate-limit at the edge, allow in robots.txt. Crawl-delay is the wrong tool here because GPTBot ignores it. Allow the crawler, then enforce request-per-second caps in your CDN or WAF. Cloudflare's "AI bot" rule category, Fastly rate-limit policies, and nginx limit_req all key off the user-agent or the published IP ranges:

User-agent: GPTBot
Allow: /
Crawl-delay: 2

The Crawl-delay: 2 is documentation for humans reading your robots.txt and a signal to compliant crawlers that aren't OpenAI. Real enforcement happens upstream.

Why "allow" is the right default

Most sites should allow GPTBot. The argument is short. Training data shapes which brands and product categories the next ChatGPT generation knows about. Being absent from that corpus is a bet that you can win attention later through paid distribution or search. For a SaaS landing page or a developer-tools doc site, that bet is bad. The marginal cost of a GPTBot crawl is bandwidth measured in megabytes per month. The marginal benefit is the chance that "the best tool for X" returns your name when a buyer asks.

The case to block flips when content itself is the product you sell. If your business model is licensing prose to publishers or selling subscriptions to original reporting, GPTBot training use erodes that product. Block it, point your lawyers at OpenAI's licensing team, and keep OAI-SearchBot allowed so you stay in the answer surface.

Verify before shipping

A robots.txt rule that blocks GPTBot but accidentally hits OAI-SearchBot — or one that allows OpenAI but is shadowed by a broader User-agent: * Disallow higher up in the file — is the kind of bug you only catch in a parser. isitready.dev reads your live robots.txt the way OpenAI's crawlers do, flags conflicts between the wildcard rule and per-agent rules, and tells you which of the three OpenAI tokens are reachable, blocked, or rate-limited at the edge. Run it before you ship the change, and again 24 hours after — OpenAI's systems take roughly a day to pick up robots.txt updates.

What GPTBot is and is not

The training vs. retrieval tradeoff

The three policies, with exact syntax

Why "allow" is the right default

Verify before shipping

Related guides