Block GPTBot and you opt out of one model. Block CCBot and you opt out of the dataset that fed dozens of them. The Common Crawl Foundation, registered as a California 501(c)(3) and chaired since 2007 by Applied Semantics co-founder Gil Elbaz, has published open monthly snapshots of the public web since 2008. The archive now exceeds 10 petabytes. Its crawler identifies as CCBot/2.0 (https://commoncrawl.org/faq/), and the resulting WARC, WAT, and WET files feed the training corpora behind GPT-3, LLaMA, T5, PaLM, RedPajama, and a long tail of open-weight models. Most site operators who block GPTBot and ClaudeBot leave CCBot allowed, which is the opposite of where the leverage lives.

Why CCBot sits upstream of everyone else

The GPT-3 paper (Brown et al., 2020) reports that 60% of the weighted pre-training mix came from a filtered Common Crawl subset spanning 41 monthly shards from 2016 through 2019, totaling roughly 410 billion BPE tokens out of 300 billion seen during training. Meta's LLaMA paper (Touvron et al., 2023) preprocessed five Common Crawl dumps from 2017 to 2020 through the CCNet pipeline, then folded in C4 on top. C4 itself, introduced by Raffel et al. in the T5 paper, is a single Common Crawl snapshot reduced to about 10% of its raw size, around 750 GB of English text. Google reused C4 for PaLM. The RedPajama 1.2T-token reproduction of LLaMA pulls five CC dumps through CCNet because that recipe is the one published to reproduce.

Block CCBot today and your site stops appearing in the next monthly snapshot. Every downstream filter, every CCNet pass, every C4 rebuild loses your URLs at the source. Blocking GPTBot, ClaudeBot, Bytespider, and Applebot-Extended individually does not produce the same effect, because most of those operators also ingest Common Crawl, and Common Crawl is the cheaper input. A 2024 Mozilla Foundation review called this "training data for the price of a sandwich": the dataset costs S3 egress to download, while a fresh independent crawl costs tens of millions.

The retroactive limit nobody mentions

A robots.txt rule added today does not erase past archives. Common Crawl's snapshots are immutable once published, and the foundation does not retroactively delete pages from old WARC files based on a future Disallow. Petabytes of historical CC sit on AWS S3 under s3://commoncrawl/, served from a public bucket, already ingested into pipelines that ran years ago. The opt-out is forward-only.

That changes the calculus on timing. A site that adds the block in 2026 still appears in every CC snapshot from 2013 through the last crawl that saw an Allow. For a site launched after the block, the opt-out is total. For a site with ten years of public history, the opt-out caps future inclusion only.

The robots.txt syntax

Common Crawl documents one user-agent token: CCBot. The crawler periodically rechecks robots.txt, supports Crawl-delay (which Google ignores), follows up to four consecutive HTTP redirects, and uses an adaptive back-off when it sees HTTP 429 or 5xx. As of late 2024 it ships from dedicated IPv4 ranges (18.97.9.168/29, 18.97.14.80/29, 18.97.14.88/30, 98.85.178.216/32) and one IPv6 block (2600:1f28:365:80b0::/60), with reverse DNS under crawl.commoncrawl.org for verification. The full list is published at https://index.commoncrawl.org/ccbot.json.

# Block Common Crawl from future snapshots.
# Past archives at s3://commoncrawl/ are immutable and unaffected.
User-agent: CCBot
Disallow: /

# Optional: cap fetch rate instead of full disallow.
# Common Crawl honors Crawl-delay; Googlebot does not.
# User-agent: CCBot
# Crawl-delay: 10

Pair the CCBot block with the operator-specific tokens covered in the GPTBot, ClaudeBot, and Google-Extended guide only if you want belt-and-suspenders coverage against operators that run their own crawlers in addition to ingesting CC.

What you give up

Common Crawl is not a search engine. Blocking CCBot does not affect Googlebot, Bingbot, or any major web search index. The visibility cost lands on a smaller set of consumers: academic NLP research, the Mozilla Common Crawl Foundation analyses, RedPajama and similar open-dataset projects, archival use cases, and a handful of experimental search startups that build on CC rather than running an original crawl. None of those are the surfaces a marketing team optimizes for.

The Common Crawl foundation says so itself in its CCBot documentation: blocking the crawler removes you from training data and from search-indexing built on the dataset. Robots.txt expresses preference, not technical enforcement. CCBot honors the protocol; bad actors spoofing the user-agent do not, which is why the foundation publishes its IP ranges for verification.

If your goal is "don't train on me," CCBot is the single highest-leverage line in robots.txt. If your goal is "don't surface me in any AI product at all," you also need the operator-specific tokens, plus a Cloudflare-style enforcement layer for crawlers that ignore the protocol. Cloudflare flipped default-blocking on for new domains in July 2025, covering CCBot alongside GPTBot, ClaudeBot, Bytespider, Meta-ExternalAgent, and Amazonbot in the AI Scrapers and Crawlers toggle.

Verify before shipping

isitready.dev parses your live robots.txt against the CCBot token, the major operator-specific tokens, and the Content Signals categories (search, ai-input, ai-train). The audit flags the common failure mode where a site blocks GPTBot and ClaudeBot but leaves CCBot — the upstream — wide open.

Why CCBot sits upstream of everyone else

The retroactive limit nobody mentions

The robots.txt syntax

What you give up

Verify before shipping

Related guides