Canonical Tags in the AI Search Era: How LLMs Handle Duplicate Content Differently Than Google

AI bot traffic hit 30 to 40 percent of edge requests at major publisher properties by Q1 2026. The CDN configuration you ship in the next 90 days decides whether those bots are nearly free or burn six figures of origin bandwidth.

By Raj Patel, AI & Infrastructure · May 25, 2026 · 18 min read

In February 2026, Cloudflare's AI Bot Traffic report put a number on a problem that infrastructure teams had been arguing about all year. Across the Cloudflare network, identified AI crawler traffic accounted for a median of 31 percent of edge requests at publisher properties, with the 90th percentile hitting 42 percent. GPTBot alone was responsible for 7.2 percent of all requests on the network. ClaudeBot had grown 340 percent year over year. PerplexityBot was the fastest-growing identified crawler on the platform. The report's conclusion was unambiguous: AI bot traffic is now a first-class load on the modern web, and most properties are paying for it at the same rate as human traffic without realizing it.

The economic implication is the part that nobody on the marketing or content side wants to hear. A median publisher with 100 million monthly origin requests is now serving 30-plus million of those to AI crawlers, and the typical CDN configuration treats those requests identically to human ones — same TTL, same revalidation logic, same origin egress charges. For a property paying around 4 cents per gigabyte of CDN bandwidth and another 8 to 12 cents per gigabyte of origin egress to a major cloud provider, the AI crawler share alone can be a five to six figure annual line item. And it is growing every quarter as the crawlers get hungrier.

The good news is that none of this is fixed cost. AI crawlers are unusually amenable to edge caching because they tolerate stale content, they re-crawl on predictable cadences, and they fetch the same canonical URLs that human users fetch. With the right edge configuration, you can serve identified AI bots at near-zero origin cost while preserving their access to your fresh content. The properties that have done this — major publishers, large SaaS documentation sites, and ecommerce platforms running on Cloudflare or Fastly — are seeing origin egress reductions of 60 to 85 percent on crawler traffic while increasing their AI citation visibility. The properties that have not have either blocked the bots (losing the citation upside) or kept paying premium origin costs to serve cacheable content over and over again.

This piece is the 2026 CDN edge cache strategy for AI crawlers. It covers what the major bots actually fetch, how to differentiate cache rules by crawler class, how to use edge KV stores for crawl tracking without burdening origin, the specific Cloudflare AI Audit, Fastly Compute@Edge, and Akamai EdgeWorkers patterns to implement, and how to handle the evergreen-versus-news distinction with stale-while-revalidate. It is meant for site reliability engineers, infrastructure architects, and the head of platform who is going to be asked to justify the next CDN bill.

Why AI Crawler Traffic Is Different From Search Crawler Traffic

Googlebot, Bingbot, and the previous generation of search crawlers established a set of patterns that most CDN configurations were designed around. They crawl on a budget allocated per property, they respect robots.txt aggressively, they emit predictable user agents, they execute JavaScript for indexing purposes, and they tend to fetch a representative sample of URLs rather than exhaustively re-fetching the same content. The modern AI crawler population — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Apple-Extended, and the rest — does not behave this way, and treating it like classic search traffic is what leads to the cost surprises.

There are five structural differences that matter for cache strategy.

Higher fetch frequency on a narrower URL set. AI crawlers fetch the same canonical URLs much more frequently than search crawlers do. The reason is that they serve two distinct workloads: bulk training corpus refresh (low frequency, broad coverage) and real-time retrieval augmentation for live AI answers (high frequency, narrow on whatever the user just asked about). Both workloads concentrate on the highest-value URLs on your property — homepages, top categories, recent articles — and they re-fetch them on cadences as short as every few minutes during topic spikes. A news article that goes viral in ChatGPT searches will be fetched by PerplexityBot dozens to hundreds of times per hour at peak.

Lower JavaScript execution rate. Per the OpenAI GPTBot documentation, GPTBot does not execute JavaScript. ClaudeBot does not. PerplexityBot's primary user agent does not, although the Perplexity browse-assistant variant sometimes does. Google-Extended inherits Googlebot's rendering pipeline. The implication is that the responses these bots fetch should be fully-rendered HTML, which means your edge cache should be caching the rendered HTML rather than just the API responses or unrendered shell. This is a separate architectural concern that we cover in server-side rendering is now mandatory for AI crawler visibility, but it interacts with cache strategy because the size of the cached object and the cost of a miss are both higher when the cached unit is fully-rendered HTML.

Tolerance for stale content. AI crawlers are notably more tolerant of stale content than Googlebot is. Googlebot penalizes properties for serving outdated or inconsistent responses because freshness is a ranking signal. AI training crawlers do not care if the response is 24 hours old, and AI retrieval crawlers cache the response on their end for the duration of the user query anyway. This tolerance is the operational lever that makes stale-while-revalidate so effective for crawler-facing cache strategy.

Predictable but uneven crawl scheduling. Each major AI crawler has a characteristic crawl pattern that you can identify in your logs. GPTBot tends to do bulk crawls in 4 to 6 hour bursts followed by quiet periods. ClaudeBot crawls more uniformly across the day with a slight US business hours peak. PerplexityBot is the spikiest of the major crawlers, with traffic correlated to user-driven search load. Knowing the pattern lets you pre-warm origin and shape cache rules to absorb the spikes at the edge.

Sensitivity to user-agent identity. Identified AI crawlers send distinctive user agents. The unidentified ones send everything from Chrome desktop strings to mobile Safari, often through residential proxy networks. The cache strategy that works for the identified population (longer TTLs, edge-served, friendly) is the opposite of the strategy that works for the unidentified population (rate limiting, challenge pages, suspicious-by-default). Conflating the two is the most common mistake we see in CDN configurations from 2024 and early 2025.

Taken together, these properties mean that AI crawler traffic should be its own configuration tier in your CDN, not a special case of human traffic. The properties that have made this shift are the ones running near-zero-cost crawler serving in 2026.

The Crawler Bandwidth Math Most Properties Are Getting Wrong

Before getting into specific configurations, it is worth establishing the order of magnitude of the bandwidth and cost involved. The properties that under-invest in this typically do so because they believe AI crawler traffic is too small to matter. By 2026, that intuition is consistently wrong.

A useful baseline is the Cloudflare AI Bot Traffic report referenced earlier and the Fastly Edge Cloud Network State report published in March 2026. Pulling the comparable numbers from both:

Metric	Cloudflare Network (Q1 2026)	Fastly Network (Q1 2026)
Median identified AI bot share of total requests	31%	27%
90th percentile AI bot share	42%	38%
GPTBot share of total traffic	7.2%	6.4%
ClaudeBot share of total traffic	4.1%	3.8%
PerplexityBot share of total traffic	3.6%	4.0%
Google-Extended share of total traffic	2.8%	2.5%
Year-over-year growth in identified AI bot traffic	+217%	+194%
Median publisher origin egress saved by edge caching	67%	61%

For a property doing 100 million monthly origin requests at typical CDN economics, the cost difference between treating AI crawler traffic as cacheable and treating it as origin-served is substantial. Assuming the median 31 percent crawler share and a per-request origin cost of 0.0008 cents (a realistic blended cost including compute, egress, and database load for a content site), the no-edge-caching baseline is roughly 31 million monthly requests at origin from crawlers, or about $248 per month in pure crawler-driven origin cost. That number sounds small. It is not.

The 31 percent share is the median, and high-value publishers are at 42 percent or higher. Origin cost per request is materially higher for properties with heavy database hits or personalization logic, often 0.003 to 0.01 cents per request. The largest publishers we have profiled are spending $40,000 to $180,000 per year on origin egress alone serving AI crawlers, and they did not know it until they pulled the crawler-specific cost breakdown out of their CDN logs. The same publishers, after applying the edge cache patterns covered below, are spending under $5,000 per year on the same crawler population while serving the same content with the same freshness guarantees.

The opportunity is not just cost. It is also the inverse: when origin-served crawler traffic becomes expensive enough that operators start blocking crawlers to manage cost, they end up removing themselves from the AI citation set in a way that costs them far more in lost referral traffic than the bandwidth ever saved. This is the worst-case outcome we see and it is preventable with edge cache strategy alone.

Differentiated Cache Rules by Crawler Class

The core architectural pattern is to identify the requesting crawler at the edge, apply a cache rule appropriate to that crawler's behavior, and route the response without ever touching origin if a fresh-enough cached copy exists. Every major CDN supports this; the implementation details differ.

The bot-class taxonomy that matters in 2026 is roughly four tiers:

Tier 1: Identified training and retrieval crawlers. This is GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, Apple-Extended, anthropic-ai, ChatGPT-User, and a handful of smaller branded crawlers. These send distinct, documented user agents and originate from published IP ranges that the major CDNs all maintain bot intelligence lists for. They are the population you want to serve aggressively from edge cache. Default policy: long TTL, stale-while-revalidate, no challenge.

Tier 2: Search and discovery crawlers. Googlebot, Bingbot, DuckDuckGo, and similar classic search agents. These behave well, respect robots.txt, and benefit from edge caching as long as you do not over-cache content that needs to be re-indexed for ranking signals. Default policy: moderate TTL, no challenge, careful revalidation rules to preserve search ranking freshness.

Tier 3: Verified researcher and archive crawlers. Common Crawl, Internet Archive's archive.org_bot, academic crawlers from major universities. These are useful long-term contributors to training corpora and you generally want them to succeed, though they are not as high-frequency as the Tier 1 crawlers. Default policy: long TTL, edge-served, rate-limited to prevent batch crawl spikes from overloading origin.

Tier 4: Unidentified and suspicious crawlers. Everything else that looks bot-like but does not identify itself, including residential-proxy scrapers and adversarial crawlers. Default policy: challenge or rate limit, do not serve from cache by default, log aggressively to inform allow/deny decisions.

Within Tier 1, you should also differentiate by crawler purpose. Training crawlers (GPTBot's bulk crawler, ClaudeBot, Common Crawl) tolerate the longest TTLs because their consumption is asynchronous. Retrieval crawlers (PerplexityBot during live searches, OAI-SearchBot serving ChatGPT browsing, ChatGPT-User on behalf of a specific user query) want fresher content because the user is waiting. A practical pattern is to apply a 7 day cache TTL with 14 day stale-while-revalidate to training-tier user agents and a 4 hour cache TTL with 24 hour stale-while-revalidate to retrieval-tier user agents. This balances freshness against origin protection appropriately for each workload.

The content axis also matters. Evergreen content (definition pages, glossary entries, reference documentation, archived articles) can be cached extremely aggressively for crawlers with TTLs in the days-to-weeks range. News and time-sensitive content needs shorter TTLs with aggressive stale-while-revalidate so crawlers always get a fresh-enough copy without hammering origin. This dimension is covered in more depth in the dynamic content cache and personalization tradeoff, but the crawler-facing pattern is simpler than the user-facing pattern because crawlers do not need personalization. You can apply maximum-aggressive caching to crawler traffic and let the personalization logic only fire for non-crawler requests.

Cloudflare AI Audit and Workers KV Patterns

Cloudflare's AI Audit feature, launched in late 2024 and expanded through 2025, provides per-crawler dashboards, granular access controls, and the underlying bot intelligence to drive cache rules. As of the May 2026 release, it identifies 22 named AI crawlers and supports per-crawler rule sets including cache TTL overrides, rate limits, and content access controls. The Cloudflare AI Audit documentation is the canonical reference, but the practical operating patterns are worth covering.

The recommended Cloudflare configuration for AI crawler optimization has three components.

Cache Rules with bot classification predicates. Use the Cache Rules engine to match cf.bot_management.verified_bot and cf.client.bot fields, paired with cf.client.bot_score thresholds. For verified AI crawlers, set Edge TTL to the crawler-class default (7 days for training, 4 hours for retrieval) and enable the stale-while-revalidate flag with the appropriate window. Critically, set Cache Key to exclude the User-Agent so that crawler responses are cached against a shared key with human responses for the same URL — this maximizes the cache hit rate by allowing crawler requests to be served from the same cached object that was populated by human traffic.

Workers KV for crawl-frequency tracking. Deploy a lightweight Worker that increments a KV counter on each identified crawler request, keyed on URL and crawler class. KV writes are eventually consistent and free under the standard request budget, making this a near-zero-cost way to build a crawler analytics pipeline at the edge. Batch the counters into 60-second windows and ship to Cloudflare Analytics Engine or your own data warehouse via Workers Logpush. The output is a per-URL, per-crawler frequency dataset you can use to identify hot URLs and tune cache rules.

AI Audit access controls for the unidentified tier. Use the Bot Fight Mode or the more granular Super Bot Fight Mode to challenge unidentified bot-like traffic, and configure the rules so that identified AI crawlers (the Tier 1 population above) are explicitly allowed through. This is the configuration that most properties get wrong: they enable bot fighting broadly and accidentally challenge legitimate AI crawlers, which then either fail to fetch or back off and reduce their crawl frequency. The Cloudflare dashboard exposes a per-bot allow/deny matrix that should be reviewed monthly.

The Cloudflare pattern is the most documented and most accessible because the bot intelligence is built into the platform. For properties already on Cloudflare, this is typically a few hours of configuration work for substantial bandwidth savings.

Fastly Compute@Edge and Edge KV Patterns

Fastly's approach is more programmatic. Compute@Edge runs WebAssembly modules at the edge with full control over cache behavior, request inspection, and key-value lookups via Edge KV. The trade-off compared to Cloudflare is more flexibility and more configuration burden. Fastly's VCL bot detection documentation and the Edge KV reference are the canonical starting points.

The Fastly pattern for AI crawler optimization typically uses a single Compute@Edge service that handles four things: user-agent classification, cache key normalization, surrogate key tagging, and crawl-frequency counter updates against Edge KV. A representative implementation looks like this:

1. User-agent classification. Parse the User-Agent header against a configured bot taxonomy. Fastly does not ship a built-in AI bot taxonomy at the same depth as Cloudflare AI Audit, so most properties maintain their own list keyed off documented user agent strings from OpenAI, Anthropic, Perplexity, Google, and Apple. Classify the request into one of the four tiers above and stamp the classification onto a custom header for downstream rules.

2. Cache key normalization. Strip query parameters that do not affect the response body, normalize the path, and exclude User-Agent from the cache key so that crawler and human requests hit the same cached object. This is the single highest-leverage cache configuration change because it converts crawler requests into cache-hit-eligible traffic against the most-populated cache key.

3. Surrogate key tagging. Apply Surrogate-Key headers to cached responses based on content type and freshness requirement. This enables targeted cache purging when content updates — you can purge all evergreen content with one key, all news content with another, and not blow away the entire cache when a single article changes.

4. Crawl-frequency counter updates. Increment Edge KV counters keyed on URL and crawler class for each identified crawler request. Edge KV's eventual consistency model is fine for this analytical use case. Ship the aggregated counters out via Fastly's real-time logging to your analytics platform.

5. Stale-while-revalidate enforcement. Set Cache-Control headers on the response from origin to include max-age and stale-while-revalidate directives appropriate to the content type and the requesting crawler class. Web.dev documents the stale-while-revalidate pattern in detail and Fastly's implementation matches the spec precisely.

The Fastly pattern is more flexible than the Cloudflare pattern but requires more upfront configuration work. For properties already running Fastly, the typical investment is one to two engineering weeks for the initial Compute@Edge service plus an ongoing maintenance burden as new crawlers emerge and user agents change.

Akamai EdgeWorkers and EdgeKV Patterns

Akamai's EdgeWorkers and EdgeKV provide functional parity with Cloudflare Workers and Fastly Compute@Edge, with the added advantage of Akamai's mature bot intelligence from the Bot Manager Premier product line. The Akamai pattern for AI crawler serving is similar in concept to the Fastly pattern but typically integrates more tightly with the Bot Manager classification layer.

The recommended Akamai configuration uses a single EdgeWorker that consumes the Bot Manager classification result (available as a header injected by the bot detection module), looks up the appropriate cache rule from a configuration loaded at EdgeWorker initialization, and writes counters to EdgeKV for crawl tracking. The cache rule itself is enforced via the Property Manager configuration, with the EdgeWorker overriding Cache-Control headers as needed before they reach the cache layer.

Two Akamai-specific notes worth flagging. First, Akamai's tiered distribution architecture (parent cache and child cache layers) means that aggressive edge caching produces compounding savings — a cache hit at the child cache layer avoids both origin and parent cache traffic. Second, Akamai's API Gateway and EdgeKV pricing model rewards higher cache hit ratios in a way that compounds with the AI crawler optimization. Properties moving from a 40 percent crawler cache hit ratio to a 92 percent crawler cache hit ratio see disproportionate cost reductions because of how the platform meters cross-tier traffic.

Evergreen Versus News: The Cache Lifetime Decision

Once the bot classification and edge infrastructure is in place, the remaining strategic decision is what cache lifetime to apply to which content. This is the dimension where most properties either over-cache (and break crawler freshness for news content) or under-cache (and burn origin bandwidth on evergreen content).

The pattern that consistently works for crawler traffic in 2026 is to segment content into four lifetime tiers:

Tier A: Permanent evergreen (definition pages, glossary, archived reference content). Cache TTL: 30 days. Stale-while-revalidate: 90 days. Purge: only on explicit republish events. This content essentially never needs to revalidate for crawler purposes. The longer the TTL, the more efficiently the cache serves repeated crawler hits.

Tier B: Slow-changing content (most articles after the first 72 hours, product documentation, category pages). Cache TTL: 7 days. Stale-while-revalidate: 14 days. Purge: on content update via surrogate key. This is the majority of most content properties, and the TTL choice here drives the largest share of bandwidth savings.

Tier C: News and recent content (articles within the first 72 hours, real-time dashboards, recent commentary). Cache TTL: 4 hours. Stale-while-revalidate: 24 hours. Purge: on update via surrogate key, plus a scheduled refresh on the canonical updated-at signal. This is the trickiest tier because crawler freshness expectations are highest here, but stale-while-revalidate makes it manageable.

Tier D: Personalized or session-dependent content (logged-in dashboards, A/B tested landing pages, dynamic product recommendations). Cache TTL: 0 for human users (no edge cache). For crawlers, serve a non-personalized canonical variant with Tier B caching rules. This is the pattern that lets you preserve human-facing personalization without crawler caching breaking.

This segmentation typically requires a small amount of content metadata work — you need to know which articles are in which tier — but most CMSes already have this distinction modeled. The work is usually adding a header or surrogate key that propagates the tier classification to the CDN edge.

The interaction with sitemap design matters here. The same content classification you use for cache lifetime should inform sitemap segmentation for AEO crawl priority, because crawlers use sitemap signals to decide where to spend their crawl budget. Aligning the two means the crawler asks for the URLs you have already optimized to serve cheaply, and the cache and sitemap layers compound rather than fight each other.

A 9-Step Playbook for AI Crawler Edge Optimization

Use the following sequence to bring an existing CDN configuration into the 2026 optimal state for AI crawler traffic. The order matters: each step makes the next one easier or impossible.

1. Measure your current AI crawler share. Pull 30 days of CDN logs and bucket requests by User-Agent into the four tiers from above. Calculate the percentage of total requests, the percentage of total origin bandwidth, and the per-tier hit ratio against your existing cache. This baseline is the input to every subsequent decision. Most properties discover their crawler share is higher and their crawler cache hit ratio is lower than they expected.

2. Verify identified crawler IP ranges. For each Tier 1 crawler, verify the requests are originating from the published IP ranges. OpenAI publishes GPTBot IP ranges, Anthropic publishes ClaudeBot ranges, Perplexity publishes their ranges, Google publishes Google-Extended ranges. Requests using these user agents from non-published IPs are spoofed and should be classified as Tier 4. This step alone removes 5 to 15 percent of apparent crawler traffic at most properties.

3. Normalize cache keys to exclude User-Agent. Reconfigure your cache key generation to omit the User-Agent header for content URLs. This is the single most impactful change because it makes crawler requests cache-hit-eligible against the same key that human requests populate. Test thoroughly: this change interacts with any User-Agent-keyed content variation logic you may have for AMP or mobile-specific responses.

4. Apply tiered cache TTLs by content type. Implement the four-tier content lifetime model from the previous section, propagating the tier label as a header or surrogate key from origin through to the CDN. Start conservatively (shorter TTLs) and lengthen as you build confidence in your purge mechanics.

5. Enable stale-while-revalidate across all tiers. Set Cache-Control headers to include the stale-while-revalidate directive on all cacheable responses. This is the single most operationally valuable directive for crawler-facing serving because it converts revalidation from blocking to background.

6. Differentiate cache rules by crawler tier. Apply the Tier 1 long-TTL configuration to identified AI crawlers via your CDN's bot classification engine. Apply rate limiting and challenge logic to Tier 4 traffic. Audit the cross-classification matrix to make sure identified AI crawlers are not accidentally caught by Tier 4 rules.

7. Deploy edge KV crawl tracking. Implement the Workers KV, Edge KV, or EdgeKV counter pattern to track per-URL, per-crawler request frequency. Ship the data to your analytics warehouse. Build a weekly review of which URLs are over-crawled and which are missed.

8. Set up surrogate key purging. Tag cached responses with surrogate keys aligned to your content tiers and content collections. Wire your CMS publish events to purge the appropriate surrogate keys on update. This makes long TTLs safe because you can invalidate specific content slices without flushing the cache.

9. Establish an ongoing crawler observability practice. Schedule monthly reviews of new crawler user agents (the population grows roughly 15 percent quarter over quarter), cache hit ratios per tier, origin bandwidth attribution per crawler, and the alignment between crawler frequency and content business value. The configuration that is correct in May 2026 will need updates by August.

This playbook takes 3 to 6 weeks of focused engineering for a typical content property, plus an ongoing 1 to 2 days per month of maintenance. The cost recovery is usually realized within the first quarter of operation.

The Crawler Permission Economy Implication

The optimization above assumes you want the major AI crawlers to succeed at your property. That assumption is correct for the vast majority of properties in 2026, because the citation upside from being in the training and retrieval corpora is larger than the bandwidth cost of serving the crawlers. But the calculus is evolving as the crawler permission economy matures.

OpenAI, Anthropic, and Perplexity have all formalized programs in late 2025 and early 2026 that pay select publishers for crawl access and citation rights. Common Crawl's bandwidth and usage statistics suggest the total volume of training-related crawling has roughly tripled since 2023. The economic relationship between crawlers and publishers is becoming negotiated rather than implicit, and the CDN configuration you ship today is the foundation for how you participate in those negotiations. A property that cannot measure its crawler traffic granularly cannot demand fair compensation for it. The deeper analysis of how this evolves is in the crawler permission economy and training data monetization, but the infrastructure prerequisite is the same: you need the edge cache and observability stack described above to be a participant rather than a price-taker.

The properties that are doing this well in 2026 treat AI crawler serving as a core platform capability, not a CDN configuration tweak. They have an owner, a budget, a monthly review cadence, and a roadmap. They use Cloudflare AI Audit, Fastly Compute@Edge, or Akamai EdgeWorkers as the implementation surface and they layer their own observability and policy on top. They have moved from defensive (block bots, control access) to offensive (serve bots cheaply, measure influence, negotiate value).

Takeaway: AI crawler traffic is no longer a rounding error on your CDN bill. At 30 to 40 percent of edge requests across the median publisher property in 2026, it is a first-class workload that deserves its own configuration tier and operational practice. The properties that have built that practice — using edge cache differentiation, stale-while-revalidate, edge KV crawl tracking, and bot-class-aware cache rules — are serving GPTBot, ClaudeBot, PerplexityBot, and Google-Extended at near-zero origin cost while preserving the citation upside that makes the crawlers worth welcoming. The properties that have not are either burning six figures of origin bandwidth or blocking the crawlers and losing the AI search referral channel. The 9-step playbook in this piece, executed over a quarter, moves you decisively from the second group into the first.

Frequently Asked Questions

How much of my CDN traffic is now AI bots in 2026?

For most content-heavy properties, AI bot traffic sits between 18 and 42 percent of total edge requests as of Q1 2026, with the median publisher landing near 31 percent according to Cloudflare's AI Bot Traffic report from February 2026. The composition has shifted dramatically since 2024. GPTBot and OAI-SearchBot together account for roughly 11 to 14 percent of bot traffic across the Cloudflare network, ClaudeBot and anthropic-ai for another 7 to 9 percent, PerplexityBot for 5 to 8 percent, and Google-Extended for 3 to 5 percent. The remaining 10 to 15 percent comes from a long tail of smaller training crawlers, search-specific agents like Common Crawl, and a growing population of unidentified scrapers using residential proxies. If you have not measured this on your own property in the last 60 days, you almost certainly have more bot traffic than you think, and you are paying for it at human-traffic rates.

Should I block GPTBot and ClaudeBot to save bandwidth?

Almost never. Blocking these crawlers removes you from the training corpus and the live retrieval set that feeds AI search results, which is the single largest source of new referral traffic for many publishers in 2026. The right move is to serve them efficiently from your CDN edge rather than block them at origin. With aggressive edge caching, a typical GPTBot crawl costs you fractional cents per million requests because the bot is overwhelmingly hitting cached objects. The economics flip only if you are seeing pathological crawl patterns: a single user agent fetching the same URL hundreds of times per hour, or hitting expensive endpoints like search results or personalized pages. In those cases, the correct response is targeted rate limiting and cache-control rules, not a wholesale block. The Cloudflare AI Audit and Fastly Edge Cloud features now expose this data clearly enough that the decision should be data-driven, not reflexive.

What is stale-while-revalidate and why does it matter for AI crawlers?

Stale-while-revalidate is an HTTP cache directive that tells a CDN to serve a stale cached response immediately while asynchronously fetching a fresh copy from origin. For AI crawlers, this is the single most important cache pattern to get right. AI bots tolerate slightly stale content well because they are not building real-time experiences and they typically re-crawl on a multi-day cadence anyway. By using max-age combined with stale-while-revalidate windows of 24 to 72 hours for evergreen content, you guarantee the bot gets an instant edge response even when the cached object has technically expired, while your origin handles only one revalidation request rather than every crawler hit. Cloudflare, Fastly, and Akamai all support the directive natively. Web.dev documents it as the recommended pattern for content that is mostly static but occasionally updated. Combined with surrogate keys for purging, it is the foundation of efficient AI crawler serving.

Can I serve different cache policies to GPTBot than to human users?

Yes, and you should. Most CDNs let you key cache rules off the User-Agent header or a derived bot-class signal, and applying longer time-to-live values to crawler traffic is one of the highest-leverage configurations available. A common pattern is to serve human users a 5-minute browser cache and a 60-minute edge cache, while serving identified AI crawlers an edge cache of 24 to 72 hours with stale-while-revalidate of an additional 7 days. The risk to avoid is content cloaking. Google has explicitly clarified that serving different content body to crawlers than to users violates webmaster guidelines, but serving the same body with different cache headers is fine. Cloudflare AI Audit, Fastly Compute@Edge, and Akamai EdgeWorkers all expose bot-class detection that you can use to apply these rules without writing your own user-agent parser, and the controls are auditable from the dashboard.

How do I track AI crawler frequency at the edge without overloading my origin?

Use edge key-value storage to record crawl counters per URL per crawler, then sample to your analytics system rather than logging every request. Cloudflare Workers KV, Fastly KV Store, and Akamai EdgeKV all support sub-millisecond writes from the edge with reasonable consistency guarantees for analytical use. The standard pattern is to increment a counter keyed on URL and a bot-class label on every crawler-classified request, batch the counters into 60-second windows, and ship the aggregated deltas to your data warehouse. This gives you per-URL, per-crawler frequency data without sending raw logs to origin. You can then identify which URLs are being over-crawled (candidates for longer TTLs or sitemap deprioritization) and which are being missed (candidates for sitemap promotion or origin pre-warming). Doing this at the edge keeps your origin out of the analytics path entirely, which is the whole point.