GPTBot, ClaudeBot, PerplexityBot, Google-Extended — these tokens look interchangeable in robots.txt but they aren't. One controls training use. Another controls answer retrieval in ChatGPT. A third fires only when a user pastes a URL into Claude. This checklist covers the six layers a site should make explicit before AI systems can evaluate it reliably.
robots.txt
The file must exist at /robots.txt, return HTTP 200, and be served with Content-Type: text/plain. An overly broad Disallow: / rule that incidentally blocks AI crawlers is the most common failure — it's not always intentional, but the effect is the same: the crawler stops and moves on.
The
Sitemap:directive points to the production canonical sitemap URL, not a staging or preview origin.AI crawlers or product tokens you want to permit (for example OAI-SearchBot, Claude-SearchBot, PerplexityBot, GPTBot, or Google-Extended) are explicitly allowed, or at minimum not blocked.
Login-required paths (
/dashboard,/account,/admin) are disallowed; public docs and landing pages are explicitly allowed.No extreme
Crawl-delayvalues for crawlers that support the non-standard directive. Google ignoresCrawl-delay, while Anthropic documents support for it when appropriate.
sitemap.xml
A missing or broken sitemap forces crawlers to discover pages through link-following, which is slower and incomplete. The sitemap is the authoritative list of what you want indexed.
The file exists, returns valid XML, and is referenced from the
Sitemap:directive in robots.txt.Only production canonical URLs are listed — no
/staging.subdomains, no.workers.devpreview URLs, no localhost references.Every listed URL returns HTTP 200 when fetched directly.
<lastmod>values are accurate per-page dates, not a single date stamped across all entries (which signals the sitemap is auto-generated and not maintained).The sitemap has been submitted to Google Search Console so Google can discover changed canonical URLs more reliably.
Canonical metadata
Duplicate content — the same text reachable at multiple URLs — confuses crawlers about which version to attribute. Canonical tags eliminate the ambiguity.
Every page has exactly one
<link rel="canonical">tag in<head>, pointing to its own production URL.Canonical URLs in
<head>match the corresponding URL in sitemap.xml — disagreement between the two is a common audit finding.The same content is not served at both
http://andhttps://variants, or atwww.and non-www.variants without a redirect chain collapsing them to a single canonical.
Structured data
Structured data gives AI models explicit entity information about your site's content — author, date, topic, product details — rather than requiring them to infer it from prose.
OrganizationJSON-LD is present on the homepage withname,url, andlogo.If the site has a real public search surface,
WebSiteJSON-LD with aSearchActionis on the homepage so clients can identify that search interface.Software or product pages have
SoftwareApplicationorProductschema with at leastname,description, andoffers.Blog posts have
Articleschema withauthor,datePublished, anddateModified— missingdateModifiedis the most frequent gap.FAQ sections have
FAQPageschema so answer engines can pull question-answer pairs directly.Validate all schemas at validator.schema.org before shipping — JSON-LD syntax errors are invisible in the browser but fatal to parsers.
Security headers
Security headers don't move rankings on their own. They're public evidence of HTTPS hygiene — a site missing them looks unfinished to users and automated audits even when the content is good.
Strict-Transport-Security: max-age=31536000; includeSubDomainsis present, enforcing HTTPS-only connections for at least one year.X-Content-Type-Options: nosniffis present, preventing MIME-type confusion attacks.X-Frame-Options: DENYorSAMEORIGINis present to prevent clickjacking.Referrer-Policy: strict-origin-when-cross-originis present so cross-origin requests don't leak URL paths containing tokens or session data.
Agent-readable text files
/llms.txt is an explicit signal that a site has thought about AI
agent access. Crawlers and agents that support the proposal can use it
as a compact map before spending tokens or requests elsewhere.
/llms.txtexists at the canonical origin, is served asContent-Type: text/plain, and is valid Markdown.All URLs referenced in
/llms.txtreturn HTTP 200 and match the canonical URLs in sitemap.xml — drift between the two is the primary failure mode./llms.txtis discoverable: referenced from robots.txt (a comment line is sufficient) and linked from<head>via<link rel="alternate" type="text/plain" href="/llms.txt">.
Verify before shipping
The isitready.dev audit runs a scored version of this checklist against your live origin. Each finding includes the raw HTTP evidence and per-item remediation guidance, so you're not guessing what a crawler actually saw. Run it at isitready.dev before treating the site as AI-ready.