AI crawlers — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and others — don't behave like Googlebot. They prioritize signal density, trust signals, and crawl permission over raw link volume. This checklist covers the six layers a site needs to pass before an AI crawler can evaluate it reliably.

robots.txt

The file must exist at /robots.txt, return HTTP 200, and be served with Content-Type: text/plain. An overly broad Disallow: / rule that incidentally blocks AI crawlers is the most common failure — it's not always intentional, but the effect is the same: the crawler stops and moves on.

  • The Sitemap: directive points to the production canonical sitemap URL, not a staging or preview origin.

  • AI crawlers you want citing you (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended) are explicitly allowed, or at minimum not blocked.

  • Login-required paths (/dashboard, /account, /admin) are disallowed; public docs and landing pages are explicitly allowed.

  • No Crawl-delay set above 10 seconds for major crawlers — higher values cause most crawlers to deprioritize the site entirely.

sitemap.xml

A missing or broken sitemap forces crawlers to discover pages through link-following, which is slower and incomplete. The sitemap is the authoritative list of what you want indexed.

  • The file exists, returns valid XML, and is referenced from the Sitemap: directive in robots.txt.

  • Only production canonical URLs are listed — no /staging. subdomains, no .workers.dev preview URLs, no localhost references.

  • Every listed URL returns HTTP 200 when fetched directly.

  • <lastmod> values are accurate per-page dates, not a single date stamped across all entries (which signals the sitemap is auto-generated and not maintained).

  • The sitemap has been submitted to Google Search Console so Googlebot (and Google-Extended) pick up changes promptly.

Canonical metadata

Duplicate content — the same text reachable at multiple URLs — confuses crawlers about which version to attribute. Canonical tags eliminate the ambiguity.

  • Every page has exactly one <link rel="canonical"> tag in <head>, pointing to its own production URL.

  • Canonical URLs in <head> match the corresponding URL in sitemap.xml — disagreement between the two is a common audit finding.

  • The same content is not served at both http:// and https:// variants, or at www. and non-www. variants without a redirect chain collapsing them to a single canonical.

Structured data

Structured data gives AI models explicit entity information about your site's content — author, date, topic, product details — rather than requiring them to infer it from prose.

  • Organization JSON-LD is present on the homepage with name, url, and logo.

  • WebSite JSON-LD with a SearchAction is on the homepage so agents know a search interface is available.

  • Software or product pages have SoftwareApplication or Product schema with at least name, description, and offers.

  • Blog posts have Article schema with author, datePublished, and dateModified — missing dateModified is the most frequent gap.

  • FAQ sections have FAQPage schema so answer engines can pull question-answer pairs directly.

  • Validate all schemas at validator.schema.org before shipping — JSON-LD syntax errors are invisible in the browser but fatal to parsers.

Security headers

AI crawlers use security signals as a proxy for site trustworthiness. A site without HTTPS enforcement or with missing security headers scores lower on quality signals even if the content is good.

  • Strict-Transport-Security: max-age=31536000; includeSubDomains is present, enforcing HTTPS-only connections for at least one year.

  • X-Content-Type-Options: nosniff is present, preventing MIME-type confusion attacks.

  • X-Frame-Options: DENY or SAMEORIGIN is present to prevent clickjacking.

  • Referrer-Policy: strict-origin-when-cross-origin is present so cross-origin requests don't leak URL paths containing tokens or session data.

Agent-readable text files

/llms.txt is the explicit signal that a site has thought about AI agent access. Crawlers and agents that support the spec will read it before crawling anything else.

  • /llms.txt exists at the canonical origin, is served as Content-Type: text/plain, and is valid Markdown.

  • All URLs referenced in /llms.txt return HTTP 200 and match the canonical URLs in sitemap.xml — drift between the two is the primary failure mode.

  • /llms.txt is discoverable: referenced from robots.txt (a comment line is sufficient) and linked from <head> via <link rel="alternate" type="text/plain" href="/llms.txt">.

Verify before shipping

The isitready.dev audit runs a scored version of this checklist against your live origin. Each finding includes the raw HTTP evidence and per-item remediation guidance, so you're not guessing what a crawler actually saw. Run it at isitready.dev before treating the site as AI-ready.