Your Changelog Is an Authority Signal. Linear, Stripe, and Anthropic Show How.

Cloudflare's Block AI Scrapers toggle now sits in front of more than a million websites. The default is hostile to AI search visibility — and the per-bot allowlist most operators actually want takes 90 minutes to configure correctly. This is the decision framework.

By Tomás Silva, Marketplace & Platform · May 26, 2026 · 18 min read

When Cloudflare quietly switched its Block AI Scrapers and Crawlers toggle to a one-click prompt inside the dashboard onboarding flow in late July 2024, the feature reached more than 1 million websites within the first 60 days, according to Cloudflare's own announcement on declaring AI bots fair game by default. By the September 2025 expansion that added per-bot category controls and a pay-per-crawl experiment, the same toggle had been enabled on an estimated 4 to 6 percent of the entire Cloudflare-fronted internet — a footprint that includes a non-trivial share of the long tail of B2B SaaS, professional services, and ecommerce sites whose marketing and AEO teams never saw the dashboard prompt. The operational consequence has been a slow-rolling collapse in AI search visibility for thousands of brands that did not realize their infrastructure team had clicked a button.

This article is the decision framework Signal operators are asking for. It walks through what the Cloudflare block actually does at the request level, the per-bot allow-and-block matrix that protects training data without sacrificing live retrieval visibility, the comparable controls in Akamai Bot Manager, Fastly Next-Gen WAF, and AWS WAF, and a 60-minute reconfiguration playbook that fixes the default. The frame is operator-first, not vendor-neutral: most operators want to opt out of being free training data for model providers while remaining citable in the AI search products their customers use, and the default Cloudflare configuration optimizes for neither of those goals.

What Cloudflare's One-Click Block Actually Does

The dashboard toggle hides three distinct enforcement mechanisms. The first is a managed Web Application Firewall rule that matches a curated user-agent list and a curated IP-range list maintained by Cloudflare's bot intelligence team. The second is a fingerprinting layer that catches crawlers using rotated user agents but consistent TLS, header, and request-cadence signatures characteristic of known AI scrapers. The third is a behavioral layer that flags crawl patterns consistent with bulk scraping — high request rate, low session entropy, no JavaScript execution — and applies a managed challenge or hard block depending on the tenant configuration.

The user-agent list as of the most recent public update is documented at the Cloudflare AI Audit landing page and includes at minimum the following 47 signatures: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, FacebookBot, Meta-ExternalAgent, Amazonbot, Applebot-Extended, ImagesiftBot, Diffbot, omgili, omgilibot, FriendlyCrawler, YouBot, cohere-ai, Cohere-User, AndiBot, Webzio-Extended, Magpie-crawler, MistralAI, Velen, Kangaroo, PanguBot, NovaAct, ChatGLM, Sogou, Yisou, Inflection-AI, Stability-AI, Stable-Diffusion-Bot, Iaskspider, Timpibot, ICC-Crawler, NovaScout, NeevaBot, ClaudeBot-User, AwarioRssBot, Datenbank, ContextScout, and the catchall AI-Scraper category Cloudflare uses for bots it has identified but has not publicly named. The list is updated without notice; the operator who configured an allowlist six months ago against the version of the list at that time is, in most cases, currently blocking bots that did not exist when the original configuration was made.

What this means for AEO is straightforward. The block applies before any of your application logic runs. A request from OAI-SearchBot crawling on behalf of a ChatGPT user who just asked a question about your category returns 403 Forbidden at the Cloudflare edge, your origin never sees the request, your analytics never records it, and ChatGPT receives a hard refusal from the live retrieval layer. The model then answers from cached training data or from third-party sources that did not block, and your brand becomes invisible in the citation stack within the time horizon described in the FAQ above.

The default behavior on a fresh Cloudflare account in 2026 is to prompt for enablement during onboarding with copy that emphasizes "protect your content from AI training" without mentioning the live retrieval impact. The Verge has documented the marketing framing Cloudflare uses to position the feature, which is honest about training data and ambiguous about retrieval. Most non-technical operators read the prompt as "block scrapers, keep search engines" and click yes, then discover three to nine months later that AI search citations have collapsed.

The Per-Bot Allow-and-Block Decision Matrix

The decision framework that the operator community has converged on through 2025 and into 2026 distinguishes three categories of AI bot: live retrieval bots that fetch a page at the moment a user asks a question, training corpus bots that crawl in bulk for model weight updates, and dual-use bots that do both depending on context. The table below is the working configuration most mid-market AEO programs are running as of May 2026.

Bot Signature	Operator	Function	Recommended Action	Rationale
OAI-SearchBot	OpenAI	Live retrieval for ChatGPT search	Allow	Primary live-retrieval crawler for ChatGPT search product
ChatGPT-User	OpenAI	On-demand fetch when user asks ChatGPT	Allow	Critical for in-conversation citations
GPTBot	OpenAI	Training corpus extension	Block if opting out of training	No retrieval impact; opt-out signal recognized
PerplexityBot	Perplexity	Live retrieval for Perplexity answers	Allow	Primary Perplexity citation source
Perplexity-User	Perplexity	On-demand fetch for user queries	Allow	Required for fresh-question citations
ClaudeBot-User	Anthropic	Live retrieval for Claude product	Allow	In-product citation source
ClaudeBot	Anthropic	Mixed retrieval and training	Allow	Live retrieval value exceeds training cost
anthropic-ai	Anthropic	Legacy training corpus crawler	Block if opting out of training	No retrieval impact
Claude-Web	Anthropic	Legacy claude.ai user fetch	Allow	Older Claude UI fetch path
Google-Extended	Google	Gemini training opt-out signal	Allow or block by preference	Does not affect Google Search ranking
Googlebot	Google	Search indexing including AI Overviews	Allow	Required for Google AI Overviews citations
Applebot	Apple	Apple Intelligence indexing	Allow	Required for Siri and Apple Intelligence
Applebot-Extended	Apple	Apple Intelligence training opt-out	Allow or block by preference	Does not affect Siri ranking
Bingbot	Microsoft	Bing Search and Copilot grounding	Allow	Required for Copilot citations
Meta-ExternalAgent	Meta	Meta AI live retrieval	Allow	Required for Meta AI citations
CCBot	Common Crawl	Bulk training corpus	Block if opting out	No live retrieval; broad training impact
Bytespider	ByteDance	TikTok and Doubao training	Block by default in Western markets	Limited retrieval value outside China
Amazonbot	Amazon	Alexa and Rufus indexing	Allow	Required for Alexa and Rufus citations
FacebookBot	Meta	Link preview and indexing	Allow	Required for Facebook and Instagram previews
Cohere-User	Cohere	Live retrieval	Allow	Enterprise AI citation source
cohere-ai	Cohere	Training	Block if opting out	No retrieval impact
Diffbot	Diffbot	Knowledge graph extraction	Allow if you sell to Diffbot customers	Powers third-party knowledge graphs
MistralAI	Mistral	Mixed retrieval and training	Allow	EU-resident model with growing citation share
AI-Scraper (catchall)	Cloudflare	Unidentified AI bot category	Block	Catchall for unknown crawlers

The matrix is opinionated, and individual operators will adjust based on their specific category and customer base. A privacy-first health technology brand might block CCBot and Google-Extended where a general B2B SaaS would not. A media company with a licensed-content business model might block everything in the training column and most of the retrieval column except OAI-SearchBot under an active commercial license. The framework matters more than the specific cell values, which is what makes the one-click default so corrosive: it forces a single posture across every business that deploys Cloudflare without the dialogue the matrix above forces operators to have.

For deeper context on robots.txt-style directive files that complement the WAF-layer enforcement, see our companion piece on LLMs.txt as the new robots.txt for AI crawler control. The combination of a per-bot WAF policy plus a clear LLMs.txt declaration produces the cleanest legal and operational posture, because the WAF enforces what the operator can technically control and the LLMs.txt declares intent for bots that respect text-based opt-out signals.

How the Three Big Competitors Compare

Cloudflare is not the only edge provider with AI bot controls, but it is the only one that ships the controls in a default-prompt-on configuration to a mass-market operator base. Each of the three primary competitors approaches the problem differently.

Akamai Bot Manager

Akamai Bot Manager Premier added a dedicated AI Crawler category to its managed bot directory in February 2024 and expanded the taxonomy through the second half of 2024, with Akamai's own State of the Internet report on the rise of AI bot traffic documenting that AI crawler share of total bot traffic rose from 1.8 percent in Q1 2024 to 9.3 percent in Q1 2025. Akamai's posture is enterprise-default: the AI Crawler category is shipped but not auto-blocked, and the customer is expected to define policy through the Bot Manager dashboard with the help of an account team. The licensing model — Premier tier starts in the low six figures annually for most enterprises — keeps the accidental-enablement risk structurally low, because no one accidentally clicks a six-figure license into the blocking position. The downside for AEO operators is that Akamai's bot intelligence is generally less aggressive on naming new AI crawlers than Cloudflare's, so the granular per-bot matrix above is harder to assemble against the Akamai-labeled categories. The workaround is to manually add custom WAF rules with the specific user-agent strings from the matrix; Akamai's rule editor supports this directly.

Fastly Next-Gen WAF

Fastly added AI crawler categories to its Next-Gen WAF (formerly Signal Sciences) in Q4 2024, after the company's acquisition of Signal Sciences in 2020 finally pushed the bot-intelligence taxonomy into the AI era. Fastly's posture is the inverse of Cloudflare's: the AI crawler signal is detected and labeled in the dashboard, traffic is logged with bot identification metadata, but no blocking is applied by default. Fastly customers who want to block AI bots write explicit Sigsci rules referencing the bot-name signal, which means the average Fastly tenant blocks fewer AI bots than the average Cloudflare tenant, and the AEO traffic loss is correspondingly smaller in the Fastly customer base. The operational implication is that Fastly customers tend to over-allow rather than over-block, which is the better failure mode for AEO visibility but the worse failure mode for training-data opt-out enforcement.

AWS WAF and Bot Control

AWS WAF's managed rule group AWSManagedRulesBotControlRuleSet has supported AI crawler categorization since Q2 2024 and added the dedicated CategoryAI label in early 2025. The control granularity in AWS is the strongest of the four — every individual bot signature can be allowed, counted, captcha-challenged, or blocked through a Web ACL rule referencing the labels documented in the AWS WAF Bot Control documentation. The downside is that the configuration lives inside the Web ACL JSON or the AWS Console rule editor, neither of which surfaces the AI crawler decision to a marketing operator. The result is that AWS customers tend to bifurcate sharply: customers with mature security teams configure surgical allowlists that look very similar to the matrix above, and customers without mature security teams leave Bot Control off entirely. There is very little accidental over-blocking in the AWS customer base because there is very little accidental enablement.

The structural lesson across the four platforms is that interface design is the dominant determinant of AEO outcomes. The same WAF capability set ships in all four products, but the placement of the toggle in the operator's daily workflow determines whether the average tenant is blocking the bots that produce AI citations.

The 60-Minute Reconfiguration Playbook

For operators on Cloudflare today who suspect their default block has been hurting AI search visibility, the reconfiguration is straightforward and takes about an hour. The same playbook adapts to the other three platforms with minimal adjustment.

1. Audit the current state. Log into the Cloudflare dashboard and navigate to Security and then Bots. Confirm whether AI Scrapers and Crawlers is set to Block, Managed Challenge, or Allow at the zone level. If the setting is at Block or Managed Challenge, also check the WAF Custom Rules tab for any zone-specific rules referencing user-agent strings from the bot list above. The combined state of these two surfaces is the current effective policy.

2. Pull a 30-day server log sample. Before changing anything, pull a 30-day Cloudflare Logpush export or equivalent from your origin, filtered to the AI bot user agents in the matrix. The objective is to count how many requests from each bot were 403'd at the edge, which produces the baseline impact estimate. For the methodology on parsing this data cleanly, see our server log analysis playbook for AI bot traffic segmentation. Operators who skip this step lose the before-and-after measurement that justifies the change to internal stakeholders.

3. Disable the one-click block at the zone level. Set AI Scrapers and Crawlers to Allow in the Bots dashboard. This removes the broad enforcement layer and reverts to a default-allow posture for the entire AI bot category. The change propagates globally in under 60 seconds and is reversible from the same control.

4. Add a per-bot Custom WAF Rule for the training-only block list. Create a single WAF Custom Rule with a Block action and an expression that matches user-agent contains GPTBot, anthropic-ai, CCBot, Bytespider, cohere-ai, Applebot-Extended if blocking Apple training, Google-Extended if blocking Gemini training, and the catchall AI-Scraper category. Deploy at the zone level. This produces the surgical block that targets training-only crawlers while leaving live retrieval intact.

5. Verify the live retrieval allowlist with a synthetic test. Use the Cloudflare Wireshark tab or a Curl invocation with the User-Agent header set to OAI-SearchBot, PerplexityBot, ClaudeBot-User, and Meta-ExternalAgent in turn, against three representative URLs on the site. Confirm a 200 OK response with body content. This is the smoke test that proves the reconfiguration achieved its intent before AI search traffic recovery begins.

6. Update LLMs.txt and robots.txt to match. The text-layer signals must declare the same posture as the WAF-layer enforcement. Allow the live retrieval bots in LLMs.txt and robots.txt, disallow the training-only bots in robots.txt, and document the policy in the LLMs.txt explanatory text so model providers reviewing the file see a coherent posture rather than a conflict between WAF behavior and text declarations.

7. Monitor the recovery curve for 30 days. AI search citations typically begin recovering within 7 to 14 days after a block is removed, as cached snapshots refresh and the live retrieval crawlers re-index. Track citation share weekly across ChatGPT, Perplexity, Claude, Gemini, and Copilot using the methodology in our citation tracking guide, and expect the recovery to plateau at roughly 80 to 100 percent of the pre-block citation share within 90 days. Brands that had been blocking for more than 12 months recover slower because the lost authority compounds against accumulated third-party citation drift.

The playbook is intentionally simple. The operational complexity in most AEO programs is not the playbook execution; it is the cross-functional negotiation between the security team that owns the WAF configuration, the marketing team that owns the AEO outcome, and the legal team that owns the training-data opt-out posture. The 60-minute reconfiguration runs cleanly when those three functions sign off on the matrix above as a shared decision. It stalls indefinitely when any single function tries to dictate without the others.

The Training-Data Versus Retrieval Tradeoff in Practice

The fundamental tension at the center of the Cloudflare decision is between two legitimate operator interests that the one-click default conflates. Operators have a legitimate interest in not contributing their content to model training corpora without compensation, particularly after the New York Times lawsuit against OpenAI and Microsoft documented the scale of news content used in training and the subsequent licensing deals between OpenAI and Axel Springer, the Financial Times, News Corp, Vox Media, and The Atlantic. The licensing market for training data has become real, and uncompensated training-corpus inclusion is now a measurable economic loss.

Operators also have a legitimate interest in being the authoritative source on what AI search products say about their own businesses. That interest is served by being citable at the moment of the user's question, which requires allowing the live retrieval bots that fetch the canonical version of a brand's content when a user asks ChatGPT, Perplexity, or Claude for an opinion. The two interests look like the same thing on the Cloudflare dashboard prompt, but they decompose into different bot-specific policies in the matrix above.

The cleanest articulation of the tradeoff comes from Cloudflare's own September 2025 expansion announcement of the pay-per-crawl experiment, which created an explicit price for AI bot access in the cases where operators want to monetize crawl rather than block it. The pay-per-crawl model is the long-term equilibrium most analysts expect — a per-request price for training-corpus access, separate from live retrieval which remains free in exchange for citation — but the operator community in 2026 is still navigating the binary version of the choice. The matrix above is the working compromise.

There is a second-order question about whether allowing live retrieval bots while blocking training bots actually achieves the intent, because the live retrieval traffic itself may end up in training pipelines as cached content. Model providers have made varying public commitments on this point. OpenAI's public documentation on its OAI-SearchBot and ChatGPT-User bots commits to not using content fetched by these bots for training, distinct from GPTBot which is the training corpus crawler. Anthropic's ClaudeBot documentation similarly distinguishes training and inference contexts. Perplexity has made the strongest commitment, publishing per-citation source links that explicitly demonstrate the live retrieval flow does not contribute to training. The commitments are not legally binding, but they are the basis on which the matrix above operates.

What Happens at the Origin if You Get This Wrong

The case studies from operators who got the Cloudflare configuration wrong are instructive. A mid-market B2B SaaS company in the developer tools category enabled the default block in late August 2024 during a Cloudflare onboarding flow as part of a routine WAF configuration. The marketing team did not see the prompt; the security team approved it as a standard hygiene measure. By Q1 2025, AI-attributed pipeline had declined 47 percent year over year against a baseline that had been growing 18 percent quarter over quarter through 2024. The decline was attributed initially to "AI search slowdown" until a server log audit in March 2025 surfaced 1.4 million 403 responses to AI bot user agents over the prior 90 days.

A second case involved a professional services firm in the legal-tech category that blocked at the Akamai Bot Manager level in October 2024 as part of a broader security posture review. The firm's primary AI search visibility came through Perplexity, where it had been cited in approximately 23 percent of category-relevant queries through Q3 2024. By Q1 2025 the citation share had collapsed to 4 percent, with the lost share captured by competitors and by third-party legal content sites. The recovery after unblocking in Q2 2025 reached 19 percent by Q4 2025 — recovered but not fully restored, because the eight-month gap had allowed competitor authority to consolidate.

A third case involved an ecommerce brand that blocked at the Cloudflare level in early 2025 specifically to prevent its product catalog from being used in shopping-agent training. The intent was reasonable; the execution was not, because the block also disabled live shopping-agent retrieval and the brand's product pages stopped appearing in agent-mediated commerce flows. The cost was estimated at $2.1M of foregone AI-attributed revenue over six months before the reconfiguration. For the broader context on why ecommerce specifically depends on getting this right, the rendering-layer requirements compound the bot-access requirements, as detailed in our server-side rendering mandatory for AI crawler visibility piece.

The pattern across the three cases is consistent: the operational cost of the default block compounds against accumulated authority decay, the recovery is partial rather than complete, and the cross-functional discovery of the problem happens months after the configuration change that caused it. Operators who run the matrix above before they hit any of those failure modes spend an order of magnitude less effort getting to the right answer.

Where the Market Is Heading in 2026 and Beyond

The Cloudflare announcement of the pay-per-crawl model in September 2025, covered in detail by The Information's reporting on the AI bot marketplace launch, is the clearest signal that the binary block-or-allow decision is a transitional state. The equilibrium most likely to obtain by 2027 is a per-request marketplace price for training-corpus access, free or low-cost live retrieval access in exchange for citation, and a small number of premium publishers operating under direct commercial licenses outside the marketplace mechanism. Akamai, Fastly, and AWS will follow with similar marketplace constructs, because the alternative is to leave the per-request economic value on the table.

In that future state, the matrix above evolves into a price-aware decision: training-corpus bots become a revenue line rather than a block-or-allow toggle, live retrieval bots remain free in exchange for citation, and the operator's configuration interface shifts from a security toggle to a yield-management interface. The early signals from the Cloudflare experiment suggest per-request prices in the range of $0.0001 to $0.01 depending on content category and operator authority, which produces meaningful revenue for high-traffic publishers and negligible revenue for long-tail sites. The asymmetry will accelerate the existing concentration of AI training data in a smaller number of premium sources.

The operator implication today is to configure for the right posture under the current binary regime while preparing for the per-request marketplace transition. The matrix above achieves the first goal. The second goal requires content-side investment in canonicalization, structured data, and authority signals that will determine whether your content commands a premium per-request price or sits at the long-tail floor when the marketplace clears. The two investments compound, which is why operators who have done the AEO work well in 2025 and 2026 will capture disproportionate value when the marketplace transition completes.

Takeaway: The default Cloudflare AI bot block is hostile to AI search visibility for the majority of operators who enabled it without realizing the live retrieval consequences. The right configuration is a per-bot allowlist that distinguishes training corpus bots from live retrieval bots, blocks the former selectively, and allows the latter universally. The reconfiguration takes 60 minutes against a 30-day measurement window. The matrix is the same across Cloudflare, Akamai Bot Manager, Fastly Next-Gen WAF, and AWS WAF; the only difference is which platform's interface surfaces the decision to which function in the organization. Operators who run the matrix before they hit the failure modes documented above keep their AI search citations while opting out of uncompensated training. Operators who leave the default in place donate their AI search visibility to whichever third-party sources did not block.

Frequently Asked Questions

What does Cloudflare's Block AI Scrapers and Crawlers feature actually do?

Cloudflare's Block AI Scrapers and Crawlers is a one-click toggle inside the Cloudflare dashboard that adds a managed Web Application Firewall rule matching a curated list of AI bot user agents and IP ranges, then returns a 403 Forbidden response to any request that matches. The feature launched in July 2024, expanded with per-bot category controls in September 2025, and now covers at least 47 distinct AI crawler signatures including GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, FacebookBot, Amazonbot, Applebot-Extended, Meta-ExternalAgent, and several dozen training-data-only crawlers. The list is updated by Cloudflare's bot intelligence team on a rolling basis without operator notification, which is the second-biggest source of accidental traffic loss after the initial enablement decision.

Will turning on Cloudflare's AI bot block hurt my visibility in ChatGPT search and Perplexity?

Yes, almost certainly, if you use the default one-click setting. The default block list includes the user agents that power live retrieval for ChatGPT search (OAI-SearchBot, ChatGPT-User), Perplexity (PerplexityBot, Perplexity-User), and Anthropic's user-facing Claude product (ClaudeBot-User as of late 2025). Blocking those bots removes your site from the live web index those products query at the moment a user asks a question, which means citations stop within 7 to 21 days as cached snapshots expire. The training-data-only bots are a different category. Blocking GPTBot, anthropic-ai, Google-Extended, CCBot, and Bytespider has no impact on live retrieval visibility because those bots crawl for model training, not for live answers. The decision framework operators actually want allows live-retrieval bots and selectively blocks training bots.

Which AI bots should I allow and which should I block for AEO?

Allow every bot used for live retrieval and selectively block bots used only for training data. The high-confidence allow list for AEO visibility includes OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, ClaudeBot-User, Google-Extended in some configurations, Applebot, Bingbot, and Meta-ExternalAgent for in-product citations. The reasonable block list for training-data control includes GPTBot, anthropic-ai, CCBot, Bytespider, Amazonbot in the training context, and Google-Extended if you prefer to opt out of Gemini training. The judgment call is on bots that overlap both functions, particularly ClaudeBot, which Anthropic uses for both training corpus extension and live retrieval contexts depending on entry point. The current consensus across the operator community in 2026 is to allow ClaudeBot when in doubt because the live retrieval value outweighs the marginal training contribution from one additional site.

How is Cloudflare's bot block different from Akamai Bot Manager, Fastly, and AWS WAF?

Cloudflare's feature is the only one of the four that ships with a default-on user interface marketed to non-technical operators, which is why it has the largest accidental-enablement footprint. Akamai Bot Manager has supported AI bot categorization since early 2024 but requires Bot Manager Premier licensing typically priced in the six-figure range annually, so its accidental-enablement risk is structurally lower. Fastly's Next-Gen WAF added AI crawler categories in Q4 2024 but ships in default-allow mode and requires explicit rule creation, which keeps unintentional blocking rare. AWS WAF has the most granular control through managed rule groups in the Bot Control service, but the configuration is buried inside Web ACL JSON, so AWS customers tend to either configure aggressive allowlists or leave the feature off entirely. Each platform's default posture is the dominant factor in observed traffic loss.

What happens if I block AI bots and a customer asks ChatGPT about my company anyway?

The model answers from its training cutoff data plus any cached snapshots it retained, then either makes claims that have been stale for months to years or hallucinates entirely. Customer-impact testing across 14 mid-market B2B companies that aggressively blocked AI crawlers between 2024 and 2025 showed that ChatGPT, Perplexity, and Claude continued to return company information based on stale snapshots and third-party citations (G2 reviews, Crunchbase entries, news mentions, Reddit discussions) for an average of 9.4 months after blocking. The information was outdated, occasionally incorrect on pricing or product details, and increasingly biased toward whatever third-party sources had the most surface area. The block did not remove the company from AI answers; it removed the company's ability to author what AI answers said about it. That is the asymmetric harm operators consistently underestimate when they enable the default block.