robots.txt is unforgiving in a sneaky way. The file parses on a best-effort basis. RFC 9309, published September 2022, tells parsers to skip lines they don't understand and keep going. So a typo doesn't crash anything. The crawler just drops your rule and falls back to the default — which, for an absent rule, is "allowed." Half your Disallow directives can be silently dead and your logs look fine until traffic patterns shift.

This piece is the inventory. Each section: the bug, what real crawlers do with it, and the fix.

Path case sensitivity is real

Section 2.2.2 of RFC 9309 says path matching SHOULD be case sensitive. Google, Bing, and Yandex all implement that recommendation. So this rule does nothing if your admin lives at /admin:

User-agent: *
Disallow: /Admin

Googlebot will fetch /admin/users and index it. The directive technically parsed. It just never matched a single URL. The fix is to write the path in the exact case the server emits, including any redirect target:

User-agent: *
Disallow: /admin
Disallow: /Admin

Two lines is cheap insurance when the server is case-insensitive but the canonical case isn't pinned. Verify by curling the actual URL the CMS produces — don't trust the editor's title field.

User-agent tokens are not case sensitive

The opposite rule applies to product tokens. RFC 9309 Section 2.2.1 requires case-insensitive matching against the user-agent string. So all of these reach Googlebot:

User-agent: googlebot
User-agent: Googlebot
User-agent: GOOGLEBOT

The mistake here is the inverse one — engineers assume User-agent: googlebot is broken and add a second Googlebot block "to be safe." Now you have two groups targeting the same crawler. Google picks one group and ignores the other. Per RFC 9309, a crawler picks the most specific matching group, then obeys only that group's rules. Your second block's Allow lines vanish. Pick one casing per token and delete duplicates.

The wrong wildcard kills the whole site

Three directives that look similar do completely different things:

Disallow: /
Disallow: /*
Disallow: *

Disallow: / blocks everything under the root. Disallow: /* is the same in Google's parser, since * is documented as matching any sequence of characters and the leading / covers every path. Disallow: * has no leading slash, fails the "path MUST start with /" rule from RFC 9309 Section 2.2.2, and is treated as malformed by strict parsers. Some lenient parsers ignore the line entirely. The site is now fully crawlable when you thought it was locked down.

The wildcards * and $ are Google extensions, not part of the RFC's mandatory syntax — * matches any substring, $ anchors to the end of the URL. Use them deliberately:

User-agent: *
Disallow: /*.pdf$
Disallow: /search?
Allow: /

That blocks PDF files and any search query URL while leaving everything else open. Bing supports the same extensions. Crawlers without wildcard support skip the line, which is the safe default in this direction.

A UTF-8 BOM eats your first directive

Notepad on Windows, some IDEs on save-as, and a handful of CI templating tools prepend a UTF-8 BOM (0xEF 0xBB 0xBF) to text files. Google's parser tolerates a leading BOM. Plenty of others don't. The first line becomes garbage — usually User-agent: * — and the parser starts scanning at line two looking for a group header. Every rule below your first User-agent line ends up in nobody's group and gets discarded.

Detect it from the command line:

$ file robots.txt
robots.txt: UTF-8 Unicode (with BOM) text

Or hexdump -C robots.txt | head -1 and look for ef bb bf at offset zero. Save the file as UTF-8 without BOM. In VS Code: bottom-right encoding picker, "Save with Encoding," "UTF-8" (not "UTF-8 with BOM").

Allow vs Disallow: longest match wins, then Allow

When rules collide, RFC 9309 Section 2.2.2 says the most specific match — the rule with the most octets in the path — wins. Google documents an additional tiebreaker: if Allow and Disallow are equal length, Allow wins. So this works as intended:

User-agent: *
Disallow: /reports/
Allow: /reports/public/

/reports/public/quarterly.pdf matches both rules. The Allow path is longer (16 octets vs 9), so the file is crawlable. Reverse the lengths and the URL is blocked. Engineers get this wrong by assuming order matters — it doesn't, in any spec-compliant parser. Move rules around freely; specificity decides.

The second trap: order between groups still matters for which group applies. A crawler picks one group based on user-agent specificity and ignores the rest. Putting an Allow for Googlebot inside the User-agent: * block does nothing once a User-agent: Googlebot group exists.

Crawl-delay is not portable

Crawl-delay is not in RFC 9309. Google's parser ignores it entirely — the line logs as an unrecognized directive and is dropped. Bing honors it. Yandex honors it. Anthropic documents support for it on ClaudeBot. So this file slows three of those crawlers and does nothing to Googlebot:

User-agent: *
Crawl-delay: 10

If you actually need to throttle Googlebot, use the crawl rate setting in Search Console or return HTTP 503 with a Retry-After header on the requests you want deferred. Don't write Crawl-delay and assume the problem is handled.

Sitemap and Content-Type details that matter

Sitemap: directives are file-scoped, not group-scoped. They go at the top level, not inside a User-agent block. This is broken:

User-agent: *
Disallow: /admin
Sitemap: https://example.com/sitemap.xml

It often still works — Google's parser is forgiving and pulls Sitemap lines from anywhere. Strict parsers don't. Move the directive to its own line with a blank line before it.

Two final server-side traps. Serve the file as Content-Type: text/plain. A text/html response is treated by some crawlers as not-a-robots-file and ignored. And watch your status codes: per Google's robots.txt spec, 4xx responses are interpreted as "no restrictions exist," while a 5xx response is treated as the site being temporarily unwilling to be crawled. If 5xx persists past 30 days, Google falls back to the last cached copy or, failing that, assumes full access. The 500-KiB file size cap is a hard ceiling — bytes past it are discarded. A bloated robots.txt with rules at the bottom is a config bug waiting to surface.

Verify before shipping

robots.txt failures are invisible until indexing patterns shift, and by then the bad cache has propagated. Run your file through an audit before each deploy. isitready.dev parses your live robots.txt the way Google and the major AI crawlers do, flags every silently-dropped directive, and shows the exact byte that broke the rule.