Most teams that block Applebot-Extended worry they just disappeared from Siri. They didn't. Apple introduced the token on June 11, 2024, the day after the WWDC Apple Intelligence keynote, so site owners could refuse training use without giving up product surfaces. The original Applebot still crawls. Spotlight still surfaces results. Siri's web answers still cite you. Only the training pipeline goes dark.

This misconception shapes policy decisions across entire publisher networks. The sections below cover what each token does and the exact robots.txt lines that permit Apple's products while denying Apple's foundation models.

Two tokens, two jobs

Apple operates one web crawler and one control token that sits on top of it. The crawler is Applebot, which has been running since 2015 and identifies itself with a user-agent string ending in (Applebot/0.1; +http://www.apple.com/go/applebot). It powers Spotlight Suggestions, Siri's web answers, the Safari Smart Search Field, and Safari's lookup features across iPhone, iPad, and Mac.

Applebot-Extended is different. Apple's documentation states it does not crawl webpages. It is a secondary user-agent token used to determine whether content already fetched by Applebot is eligible for inclusion in foundation model training. Think of it as a permission flag layered on top of the crawl, not a second crawler. That distinction matters when you read your access logs: you will see Applebot hits, never Applebot-Extended hits, because no separate fetcher exists.

Both tokens are honored through robots.txt. The default is opt-in for both: silence means Apple may crawl with Applebot and may use the crawled content for training under Applebot-Extended. To exclude training use, you have to write the disallow rule explicitly.

What Apple Intelligence trains on you for

The training corpus governed by Applebot-Extended powers Apple's foundation model stack. Apple ships a roughly 3-billion-parameter on-device language model and a larger mixture-of-experts server model running inside Private Cloud Compute on Apple silicon. Both are covered by the same opt-out token.

Those models drive a specific feature set:

Writing Tools: rewrite, proofread, and summarize across Mail, Messages, Notes, Pages, and any third-party app that adopts the system text view.
Siri: the LLM-backed conversational interface introduced with iOS 18, including onscreen awareness and step-by-step product guidance.
Image Playground and Genmoji: text-to-image generation for inline messages, stickers, and tapbacks.
Visual Intelligence: screen-aware search and action prompts triggered by the screenshot buttons.
Priority Notifications, summaries, and Live Translation: the smaller productivity features layered across the OS.
ChatGPT integration: routed through Apple's models for prompt classification before any external call.

Disallowing Applebot-Extended removes your content from the training pool that improves these models over time. It does not remove you from any of the product surfaces that retrieve content at query time.

The split was the whole point

Apple did not invent this pattern. Google shipped Google-Extended on September 28, 2023, splitting Gemini training permission from Googlebot's search indexing. The framing was identical: site owners wanted a way to stay in Search without consenting to model training. OpenAI followed with GPTBot plus OAI-SearchBot. Anthropic followed with ClaudeBot plus Claude-SearchBot. Apple's June 2024 update was the same playbook applied to its own ecosystem.

Apple's documentation makes the design intent explicit. From the official Apple Support page: "Even if you disallow Applebot-Extended, your website instructions may still allow Applebot to crawl your webpages. In that case, your content would remain discoverable through Spotlight, Siri, as well as other system-wide features on Apple devices."

That is the contract. The training opt-out and the product opt-in are independent levers. Treat them that way.

The split policy

Here is the directive set most publishers should ship. The first block keeps the standard crawler intact so Apple's product surfaces still see your content. The second block denies training use for Apple Intelligence:

# Allow Apple's product crawler (Spotlight, Siri, Safari)
User-agent: Applebot
Allow: /
Disallow: /api/
Disallow: /admin/

# Deny Apple Intelligence training use
User-agent: Applebot-Extended
Disallow: /

A few notes on the syntax. The two User-agent blocks are evaluated independently. Applebot-Extended does not inherit from Applebot. Wildcards under User-agent: * are not enough on their own; Apple looks for the specific token first. If you only specify Googlebot rules and no Applebot rules, Apple's documentation says Applebot will fall back to following Googlebot directives. That fallback does not apply to Applebot-Extended, which requires its own dedicated block.

If your site sits behind authentication or paywalls, none of this matters in practice. Applebot does not crawl pages that require login credentials. The opt-out only affects content that was already publicly fetchable.

When you might want to allow training

The default recommendation here is: deny Applebot-Extended unless you have a specific reason to feed Apple's training corpus. Three reasons might flip that:

Brand presence in generations. If your content describes a product or vocabulary you want surfaced inside Writing Tools rewrites or Siri answers, training inclusion may help. The effect is diffuse and hard to measure, but it exists.
Reciprocity with platforms. Some publishers have commercial relationships with Apple (News Partner Program members, App Store editorial contributors) where contributing to model quality is part of the broader arrangement.
Open-content mandate. If your site exists to feed AI training, such as public documentation portals or government data archives, blocking is counterproductive.

For everyone else, the cost of allowing training is opaque attribution and zero downstream control. The benefit is speculative. The asymmetry favors disallow.

Verify before shipping

isitready.dev parses your robots.txt, identifies whether Applebot and Applebot-Extended are addressed by name, and flags the common failure mode where a single User-agent: * block leaves training use ambiguous. Run an audit at isitready.dev to confirm the split policy reaches Apple the way you intended.