AI crawlers — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and others — don't behave like Googlebot. They prioritize signal density, trust signals, and crawl permission over raw link volume. This checklist covers the six layers a site needs to pass before an AI crawler can evaluate it reliably.
robots.txt
The file must exist at /robots.txt, return HTTP 200, and be served with Content-Type: text/plain. An overly broad Disallow: / rule that incidentally blocks AI crawlers is the most common failure — it's not always intentional, but the effect is the same: the crawler stops and moves on.
The
Sitemap:directive points to the production canonical sitemap URL, not a staging or preview origin.AI crawlers you want citing you (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended) are explicitly allowed, or at minimum not blocked.
Login-required paths (
/dashboard,/account,/admin) are disallowed; public docs and landing pages are explicitly allowed.No
Crawl-delayset above 10 seconds for major crawlers — higher values cause most crawlers to deprioritize the site entirely.
sitemap.xml
A missing or broken sitemap forces crawlers to discover pages through link-following, which is slower and incomplete. The sitemap is the authoritative list of what you want indexed.
The file exists, returns valid XML, and is referenced from the
Sitemap:directive in robots.txt.Only production canonical URLs are listed — no
/staging.subdomains, no.workers.devpreview URLs, no localhost references.Every listed URL returns HTTP 200 when fetched directly.
<lastmod>values are accurate per-page dates, not a single date stamped across all entries (which signals the sitemap is auto-generated and not maintained).The sitemap has been submitted to Google Search Console so Googlebot (and Google-Extended) pick up changes promptly.
Canonical metadata
Duplicate content — the same text reachable at multiple URLs — confuses crawlers about which version to attribute. Canonical tags eliminate the ambiguity.
Every page has exactly one
<link rel="canonical">tag in<head>, pointing to its own production URL.Canonical URLs in
<head>match the corresponding URL in sitemap.xml — disagreement between the two is a common audit finding.The same content is not served at both
http://andhttps://variants, or atwww.and non-www.variants without a redirect chain collapsing them to a single canonical.
Structured data
Structured data gives AI models explicit entity information about your site's content — author, date, topic, product details — rather than requiring them to infer it from prose.
OrganizationJSON-LD is present on the homepage withname,url, andlogo.WebSiteJSON-LD with aSearchActionis on the homepage so agents know a search interface is available.Software or product pages have
SoftwareApplicationorProductschema with at leastname,description, andoffers.Blog posts have
Articleschema withauthor,datePublished, anddateModified— missingdateModifiedis the most frequent gap.FAQ sections have
FAQPageschema so answer engines can pull question-answer pairs directly.Validate all schemas at validator.schema.org before shipping — JSON-LD syntax errors are invisible in the browser but fatal to parsers.
Security headers
AI crawlers use security signals as a proxy for site trustworthiness. A site without HTTPS enforcement or with missing security headers scores lower on quality signals even if the content is good.
Strict-Transport-Security: max-age=31536000; includeSubDomainsis present, enforcing HTTPS-only connections for at least one year.X-Content-Type-Options: nosniffis present, preventing MIME-type confusion attacks.X-Frame-Options: DENYorSAMEORIGINis present to prevent clickjacking.Referrer-Policy: strict-origin-when-cross-originis present so cross-origin requests don't leak URL paths containing tokens or session data.
Agent-readable text files
/llms.txt is the explicit signal that a site has thought about AI agent access. Crawlers and agents that support the spec will read it before crawling anything else.
/llms.txtexists at the canonical origin, is served asContent-Type: text/plain, and is valid Markdown.All URLs referenced in
/llms.txtreturn HTTP 200 and match the canonical URLs in sitemap.xml — drift between the two is the primary failure mode./llms.txtis discoverable: referenced from robots.txt (a comment line is sufficient) and linked from<head>via<link rel="alternate" type="text/plain" href="/llms.txt">.
Verify before shipping
The isitready.dev audit runs a scored version of this checklist against your live origin. Each finding includes the raw HTTP evidence and per-item remediation guidance, so you're not guessing what a crawler actually saw. Run it at isitready.dev before treating the site as AI-ready.