Cloudflare Workers run JavaScript at the edge, which means every response — HTML, robots.txt, sitemap.xml, llms.txt, headers — flows through your code before reaching crawlers or users. That's a lot of power, but it creates alignment problems when different surfaces are configured in different places and end up saying inconsistent things.
HTML metadata must be server-rendered
<title>, <meta name="description">, and <link rel="canonical"> must appear in the initial HTML response returned by the Worker, not injected by client-side JavaScript. Google's crawler executes JavaScript inconsistently and deprioritizes metadata it has to wait for. AI crawlers (GPTBot, ClaudeBot) often don't execute JavaScript at all. If your framework renders metadata client-side, move that logic into the Worker's SSR pass.
Canonical URLs: the most common alignment failure
The canonical URL for a page must match in three places: the <link rel="canonical"> tag in the HTML, the <loc> entry in sitemap.xml, and any reference to the page in llms.txt. A single mismatch causes crawlers to treat the page as having duplicate or ambiguous authority. On Workers deployments, this most often happens because the sitemap is generated against a staging origin or .workers.dev subdomain and never updated for production.
robots.txt and sitemap.xml as Worker routes
Serve robots.txt from your Workers router at exactly /robots.txt with Content-Type: text/plain. Don't offload it to a CDN origin with different cache rules than your Worker — crawlers that hit a stale or incorrect robots.txt during a deploy window will make decisions based on the old policy.
Sitemap.xml can be generated dynamically in a Worker, which is useful for keeping URLs fresh without a build step. When you do this, make sure the response returns Content-Type: application/xml and lists only canonical production URLs. Preview .workers.dev subdomains should never appear in a sitemap.
Cache-Control and crawlers
Cloudflare's edge cache will serve stale pages to crawlers if you don't set Cache-Control deliberately. For indexable pages, use:
Cache-Control: public, max-age=3600, stale-while-revalidate=86400For pages you want excluded from indexing, pair a short max-age with an explicit noindex signal:
Cache-Control: no-store
X-Robots-Tag: noindexSetting X-Robots-Tag in the response header is equivalent to a <meta name="robots" content="noindex"> tag — you don't need both, but the header works even when the crawler doesn't parse the HTML.
Block preview URLs from crawlers
Every Workers deploy gets a .workers.dev subdomain. Crawlers that index preview URLs create duplicate content problems that are annoying to clean up. Block them at two levels: add a Disallow: / rule for all user-agents in the robots.txt served from the .workers.dev origin, and return an X-Robots-Tag: noindex, nofollow header on every response from that origin.
Security headers belong in Worker code, not wrangler.toml
wrangler.toml supports a [headers] section, but those headers only apply to static assets served from the Assets binding. They do not apply to responses generated by your Worker handler — API routes, SSR pages, or any route your Worker code handles explicitly. Set security headers in the Worker response itself:
return new Response(html, {
headers: {
'Content-Type': 'text/html; charset=utf-8',
'Cache-Control': 'public, max-age=3600, stale-while-revalidate=86400',
'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload',
'X-Content-Type-Options': 'nosniff',
'Referrer-Policy': 'strict-origin-when-cross-origin',
},
});This ensures every crawlable response carries the correct policy headers, regardless of whether it comes from the assets layer or the Worker handler.
llms.txt on Workers
Serve /llms.txt either as a static asset in your Assets binding or as an explicit Worker route. Either way, the response must return Content-Type: text/plain; charset=utf-8, require no authentication, and reference only canonical production URLs — not preview or staging origins. If your llms.txt is generated at build time, add a CI check that validates every URL it contains returns HTTP 200 on the production origin before deploying.
Verify before shipping
The isitready.dev scanner checks all of these surfaces against your live origin — HTML metadata, canonical alignment, robots.txt content type, sitemap URL validity, cache headers, and llms.txt format — and flags alignment failures. Run it after every Workers deploy, not just at launch.