GPTBot, ClaudeBot, PerplexityBot, Google-Extended — these tokens look interchangeable in robots.txt but they aren't. One controls training use. Another controls answer retrieval in ChatGPT. A third fires only when a user pastes a URL into Claude. This checklist covers the six layers a site should make explicit before AI systems can evaluate it reliably.

robots.txt

The file must exist at /robots.txt, return HTTP 200, and be served with Content-Type: text/plain. An overly broad Disallow: / rule that incidentally blocks AI crawlers is the most common failure — it's not always intentional, but the effect is the same: the crawler stops and moves on.

The Sitemap: directive points to the production canonical sitemap URL, not a staging or preview origin.
AI crawlers or product tokens you want to permit (for example OAI-SearchBot, Claude-SearchBot, PerplexityBot, GPTBot, or Google-Extended) are explicitly allowed, or at minimum not blocked.
Login-required paths (/dashboard, /account, /admin) are disallowed; public docs and landing pages are explicitly allowed.
No extreme Crawl-delay values for crawlers that support the non-standard directive. Google ignores Crawl-delay, while Anthropic documents support for it when appropriate.

sitemap.xml

A missing or broken sitemap forces crawlers to discover pages through link-following, which is slower and incomplete. The sitemap is the authoritative list of what you want indexed.

The file exists, returns valid XML, and is referenced from the Sitemap: directive in robots.txt.
Only production canonical URLs are listed — no /staging. subdomains, no .workers.dev preview URLs, no localhost references.
Every listed URL returns HTTP 200 when fetched directly.
<lastmod> values are accurate per-page dates, not a single date stamped across all entries (which signals the sitemap is auto-generated and not maintained).
The sitemap has been submitted to Google Search Console so Google can discover changed canonical URLs more reliably.

Canonical metadata

Duplicate content — the same text reachable at multiple URLs — confuses crawlers about which version to attribute. Canonical tags eliminate the ambiguity.

Every page has exactly one <link rel="canonical"> tag in <head>, pointing to its own production URL.
Canonical URLs in <head> match the corresponding URL in sitemap.xml — disagreement between the two is a common audit finding.
The same content is not served at both http:// and https:// variants, or at www. and non-www. variants without a redirect chain collapsing them to a single canonical.

Structured data

Structured data gives AI models explicit entity information about your site's content — author, date, topic, product details — rather than requiring them to infer it from prose.

Organization JSON-LD is present on the homepage with name, url, and logo.
If the site has a real public search surface, WebSite JSON-LD with a SearchAction is on the homepage so clients can identify that search interface.
Software or product pages have SoftwareApplication or Product schema with at least name, description, and offers.
Blog posts have Article schema with author, datePublished, and dateModified — missing dateModified is the most frequent gap.
FAQ sections have FAQPage schema so answer engines can pull question-answer pairs directly.
Validate all schemas at validator.schema.org before shipping — JSON-LD syntax errors are invisible in the browser but fatal to parsers.

Security headers

Security headers don't move rankings on their own. They're public evidence of HTTPS hygiene — a site missing them looks unfinished to users and automated audits even when the content is good.

Strict-Transport-Security: max-age=31536000; includeSubDomains is present, enforcing HTTPS-only connections for at least one year.
X-Content-Type-Options: nosniff is present, preventing MIME-type confusion attacks.
X-Frame-Options: DENY or SAMEORIGIN is present to prevent clickjacking.
Referrer-Policy: strict-origin-when-cross-origin is present so cross-origin requests don't leak URL paths containing tokens or session data.

Agent-readable text files

/llms.txt is an explicit signal that a site has thought about AI agent access. Crawlers and agents that support the proposal can use it as a compact map before spending tokens or requests elsewhere.

/llms.txt exists at the canonical origin, is served as Content-Type: text/plain, and is valid Markdown.
All URLs referenced in /llms.txt return HTTP 200 and match the canonical URLs in sitemap.xml — drift between the two is the primary failure mode.
/llms.txt is discoverable: referenced from robots.txt (a comment line is sufficient) and linked from <head> via <link rel="alternate" type="text/plain" href="/llms.txt">.

Verify before shipping

The isitready.dev audit runs a scored version of this checklist against your live origin. Each finding includes the raw HTTP evidence and per-item remediation guidance, so you're not guessing what a crawler actually saw. Run it at isitready.dev before treating the site as AI-ready.