SignalFeed

The Edge Rendering Problem: Why Your CDN Might Be Hurting AI Search Visibility

CDN edge caching is optimized for human browsers — not AI crawlers visiting once every 2 weeks. The edge configuration decisions that kill AI crawl frequency.


In a January 2026 analysis of server logs across 47 B2B SaaS and media sites, Prerender.io found that 63% of sites had at least one CDN configuration that was actively blocking or degrading AI crawler access — not intentionally, but as a side effect of bot-protection and caching rules built for an entirely different threat model. The sites did not know they had a problem. Their Google Search Console data looked fine. Their CDN dashboards showed healthy cache hit rates. Their AI search visibility was quietly collapsing.

This is the edge rendering problem in its most damaging form: the infrastructure decisions that improved performance and security for human browsers are structurally hostile to the crawlers that now determine whether your content appears in ChatGPT, Perplexity, and Claude.

The problem is not theoretical. It is specific, measurable, and fixable — but only if you understand how AI crawlers actually behave, how they differ from Googlebot, and which edge configuration decisions trip them up most often. This piece covers all three.

How AI Crawlers Differ From Google's Bot

Every CDN and WAF configuration in existence was built with one primary crawler in mind: Googlebot. The bot-management industry, the rate-limiting defaults, the challenge-page logic, and the caching rules were all calibrated against a decade of understanding how Google's crawler behaves. AI crawlers behave differently in almost every measurable dimension, and those differences are the root cause of the misconfiguration epidemic.

Visit frequency. Googlebot visits high-authority pages hourly or daily. GPTBot visits the same pages every 14 to 28 days. PerplexityBot is faster for high-freshness content — 3 to 7 days — but slower for static pages. ClaudeBot operates on approximately a 14 to 21 day cycle. Rate-limiting systems calibrated to flag suspicious traffic based on request patterns that look nothing like Googlebot will sometimes trigger on AI crawlers, but more often the problem is the opposite: AI crawlers make bursts of requests during a crawl window and then disappear for weeks, which looks to anomaly-detection systems like a probe or scanner rather than a legitimate indexer.

IP range behavior. Googlebot's IP ranges are well-documented, stable, and allowlisted in virtually every WAF configuration. AI crawler IP ranges are newer, less stable, and frequently not allowlisted by default. OpenAI publishes GPTBot's IP ranges at platform.openai.com/docs/gptbot. Anthropic publishes ClaudeBot's ranges at anthropic.com/research/crawling. Perplexity's ranges are documented but less consistently maintained. If your WAF uses IP reputation lists that have not been updated to include these ranges, AI crawlers will trigger unknown-bot handling — which typically means CAPTCHA challenges or 403 blocks.

Rendering expectations. Googlebot executes JavaScript and renders pages. AI crawlers generally do not — they expect to receive fully-rendered HTML in the HTTP response. This is covered in more detail in Why SSR Is Now Mandatory for AI Crawler Visibility, but the CDN dimension is distinct: if your edge configuration serves different content based on User-Agent (a common pattern for AMP pages, mobile-specific responses, or bot-detection cloaking checks), AI crawlers may receive a stripped-down or empty response rather than the full page content.

Crawl budget behavior. Googlebot has sophisticated crawl budget mechanisms and respects robots.txt delay directives carefully. AI crawlers are less predictable. During an active crawl session, GPTBot may request 50 to 200 pages in a short window, then not return for 20 days. If your rate-limiting is set at, say, 30 requests per minute per IP — a common threshold for anti-scraping protection — a GPTBot burst can trigger rate limiting mid-crawl, resulting in a partial index that excludes the pages at the back of the crawl queue.

User-agent matching. Bot management systems that use user-agent fingerprinting rather than declared user-agent strings can misidentify AI crawlers. GPTBot announces itself clearly in the User-Agent header. ClaudeBot and PerplexityBot do the same. But WAF systems that use behavioral fingerprinting — analyzing request timing, header patterns, and TLS fingerprints — may classify AI crawlers as non-disclosed bots regardless of the declared user-agent, because the behavioral profiles overlap with known scraper profiles.

The Edge Caching Stale Content Problem

Edge caching is one of the most effective performance optimizations available to web operators. It is also one of the most common sources of AI crawler content staleness, for a reason that is easy to overlook: cache TTLs were set for human traffic patterns, not AI crawl patterns.

Consider a typical B2B SaaS marketing site with a CDN edge cache TTL of 7 days for product and pricing pages. The logic is sound for human browsers: most visitors will return within 7 days, the content updates infrequently, and a 7-day TTL delivers excellent performance with low origin load. But for an AI crawler visiting on day 6 of the cache cycle, the response it receives may be 6 days old. If the pricing page was updated on day 3 to reflect a new pricing tier, the crawler receives the old pricing. If a product page was corrected on day 4 to remove a feature that was deprecated, the crawler sees the deprecated feature claim. The AI assistant then cites the stale information in user responses for the next two to four weeks, until the crawler returns and receives updated content.

The problem compounds when cache invalidation events do not reach all edge nodes simultaneously. Cloudflare, Fastly, Akamai, and AWS CloudFront all have propagation delays for cache purge operations — typically a few seconds globally for standard purges, but up to several minutes for large-scale purge operations or partial invalidations. An AI crawler that happens to hit an edge node during the propagation window receives stale content even if the site operator believes the cache has been invalidated.

The practical impact is measurable. In a controlled test run by the team at Botify in Q1 2026, they found that AI assistant citations for product features on 12 SaaS sites contained stale information in 31% of cases — outdated pricing, deprecated features, or removed integrations — traceable to CDN cache serving content that had been updated but not yet re-crawled. The citation staleness persisted for an average of 19 days before the crawlers returned and refreshed.

Cache TTL SettingProbability of Stale Content at AI Crawler VisitAverage Staleness Window
1 hour~2%0.5 days
24 hours~8%2 days
7 days~31%8 days
30 days~67%19 days
Never expires~89%Indefinite

These figures are approximations based on the Botify study and published crawl frequency data, but the directional pattern is consistent: longer TTLs dramatically increase the probability that AI crawlers receive stale content.

Bot Detection False Positives: The Silent Blocker

Bot detection false positives are the most severe AI crawler access problem, because they do not produce partial or stale content — they produce no content at all. A site that is serving CAPTCHA challenges to GPTBot is effectively invisible to the GPT-4o browsing model and future OpenAI index updates.

The false positive problem originates in the mismatch between what bot management systems were trained to detect and what AI crawlers actually are. Modern bot management systems — Cloudflare's Bot Fight Mode, Akamai's Bot Manager, Imperva's Advanced Bot Protection, and others — use machine learning models trained on historical bot traffic to classify new requests as legitimate or malicious. The training data for these models predates the AI crawler era, so the models were never exposed to the specific behavioral patterns of GPTBot or ClaudeBot. Instead, they classify these crawlers against the nearest known pattern — which is often a content scraper.

The symptoms of false positive blocking are specific:

  • CAPTCHA challenges: The site serves a JavaScript-rendered CAPTCHA page instead of content. AI crawlers do not execute JavaScript, so they receive the CAPTCHA HTML, which contains no useful content.
  • 403 Forbidden responses: The WAF blocks the request outright and returns a 403. The crawler logs this as a denied access and may deprioritize the domain in future crawl cycles.
  • 503 Service Unavailable responses: Rate-limiting systems may return 503 when the AI crawler's burst pattern triggers the threshold. The crawler backs off — sometimes permanently for that URL.
  • JavaScript challenges: Cloudflare's "I'm Under Attack" mode and similar JavaScript-based challenge pages present a page that requires JavaScript execution to pass. AI crawlers receive the challenge HTML and cannot proceed.

Identifying whether your site is experiencing false positive blocking requires server log analysis specifically targeted at known AI crawler user-agents. The minimum viable check:

1. Pull 30 days of server logs and filter for requests with User-Agent strings matching: `GPTBot`, `ClaudeBot`, `PerplexityBot`, `Amazonbot`, `Bytespider` (ByteDance's AI indexer), and `meta-externalagent`.

2. Classify response codes. What percentage of these requests received 200 responses? What percentage received 403, 429, 503, or redirects to challenge pages?

3. Compare to Googlebot response rates. If Googlebot receives 95%+ 200 responses on the same pages and AI crawlers receive significantly lower rates, the difference is almost certainly a bot management false positive.

4. Check for challenge page content. Even 200 responses can be deceptive if the CDN is serving a challenge page with a 200 status code (some configurations do this). Check a sample of 200 responses from AI crawlers against the expected page content — if the body length is dramatically shorter than the actual page, a challenge page is likely being served.

For a comprehensive technical approach to checking what AI crawlers actually see when they visit, the audit methodology in Is Your React App Invisible to AI Search? applies directly to CDN-level blocking as well.

Rate Limiting: The Crawl Budget Killer

Rate limiting is the subtlest of the three major AI crawler access problems, because it does not block access entirely — it throttles it. A site that rate-limits AI crawlers to 20 requests per minute may appear accessible while silently ensuring that large sections of the site are never crawled during any given visit window.

The math is unforgiving. GPTBot visits a site and begins crawling. Your rate limiter allows 20 requests per minute. The crawler visits 20 pages and hits the threshold. It backs off for 60 seconds, then resumes. Over a 30-minute crawl window, it gets 600 page visits. For a site with 10,000 pages, this means approximately 6% of the site is crawled per visit. If the crawler visits once every 21 days, the average page is crawled roughly once every 12 months — approximately the same crawl frequency as a long-tail page on a low-authority Google index.

The prioritization problem is worse than the frequency problem. AI crawlers do not start from a fixed sitemap order — they follow link signals and freshness signals to prioritize what to crawl next. When a rate limiter truncates a crawl session, the pages that get cut are typically the ones the crawler was about to visit for the first time, or the recently-updated pages that the crawler's freshness scoring had queued for re-crawl. Core pages often get re-crawled; updated content at depth often does not.

The standard rate-limiting thresholds that create AI crawler problems:

Rate Limit SettingImpact on AI Crawl Coverage
10 req/min per IPSevere — most sites will have <40% crawl coverage
20 req/min per IPModerate — sites with >2,000 pages will have significant gaps
60 req/min per IPLow — adequate for most sites under 10,000 pages
120 req/min per IPMinimal — sufficient for large sites with active content programs
No AI crawler limitOptimal — balance with anti-abuse monitoring

The appropriate fix is to create a separate rate-limiting policy for known AI crawler user-agents, with higher thresholds than your default bot protection. This is not a security risk: AI crawlers are read-only HTTP clients making GET requests for publicly accessible content. The risk profile is categorically different from a credential-stuffing bot or a DDoS agent.

Cloudflare Configuration: The Specific Steps

Cloudflare is the CDN and security platform used by the majority of mid-market and enterprise web properties, so the Cloudflare-specific configuration is the highest-impact remediation for most operators. The following steps address all three problem categories: false positive blocking, rate limiting, and cache staleness.

Step 1: Create an AI Crawler WAF Bypass Rule

In Cloudflare's Security → WAF section, create a Custom Rule with the following logic:

  • When: `(http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "PerplexityBot") or (http.user_agent contains "Amazonbot") or (http.user_agent contains "Bytespider") or (http.user_agent contains "meta-externalagent")`
  • Then: Skip — WAF rules, Rate limiting rules, Bot Fight Mode

This rule tells Cloudflare to bypass its bot management and WAF for requests that declare themselves as recognized AI crawlers. You are trusting the declared user-agent, which is a reasonable trust model for crawlers that have published their user-agent strings and IP ranges publicly.

Step 2: Verify with IP Allowlist

For additional assurance, supplement the user-agent rule with IP allowlisting for the published ranges of the major AI crawlers. OpenAI publishes its GPTBot ranges. Anthropic publishes ClaudeBot's ranges. Perplexity publishes its ranges. These IP lists need periodic review as the companies scale their crawl infrastructure, but they provide a fallback verification layer beyond user-agent matching.

Step 3: Create an AI Crawler Cache Rule

In Cloudflare's Caching → Cache Rules section, create a rule that targets the same user-agent criteria as Step 1 and sets:

  • Edge Cache TTL: 24 hours maximum (rather than your default, which may be 7 days or longer)
  • Browser Cache TTL: Respect Origin (to avoid conflicting with your existing browser cache headers)
  • Cache Status: Cache Everything (to ensure the CDN is serving cached content rather than passing all AI crawler requests to the origin, which would add load without meaningfully improving freshness)

Step 4: Create a Separate Rate Limiting Rule

Create a Rate Limiting rule specifically for AI crawler user-agents with a threshold of at least 100 requests per 60 seconds per IP. Apply this rule before your default bot rate-limiting rule to ensure AI crawlers receive the higher threshold rather than the lower default.

Step 5: Audit Your robots.txt

Verify that your robots.txt file does not include `Disallow` directives for GPTBot, ClaudeBot, or PerplexityBot. If you have intentionally blocked these crawlers for training-data reasons, you should understand the AEO trade-off: blocking AI training crawlers and blocking AI search crawlers may or may not be the same bot, depending on the AI company's architecture. The crawler permission economy is covered in depth separately, but the core CDN-level point is this: a robots.txt Disallow is a permission signal, not a technical block. CDN-level blocking is a technical block. Make sure your intention matches your implementation.

Fastly Configuration: VCL-Based Approach

Fastly's configuration model is different from Cloudflare's — it uses Varnish Configuration Language (VCL) rather than a GUI-driven rule system. The equivalent configuration in VCL:

In vcl_recv:

```vcl if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|PerplexityBot|Amazonbot|Bytespider)") { set req.http.X-AI-Crawler = "true"; return(pass); } ```

The `return(pass)` directive instructs Fastly to bypass the cache for this request and send it directly to the origin, ensuring AI crawlers always receive fresh content. The trade-off is increased origin load — acceptable for infrequent AI crawler visits, which represent a tiny fraction of total traffic.

In vcl_fetch:

```vcl if (req.http.X-AI-Crawler == "true") { set beresp.ttl = 3600s; set beresp.grace = 0s; } ```

This sets a 1-hour TTL for AI crawler responses cached at the Fastly edge (relevant if you prefer caching over pass), with no grace period to prevent stale content serving.

For Fastly customers using Compute@Edge (now Fastly Compute) rather than VCL, the equivalent logic can be implemented in WebAssembly using the Fastly SDK.

Cache-Control Headers: The Origin-Side Configuration

CDN configuration addresses the edge layer, but origin-side Cache-Control headers are equally important. Many AI crawler access problems are caused not by CDN misconfiguration but by origin servers sending headers that instruct CDNs to cache content longer than appropriate, or sending headers that trigger CDN behaviors incompatible with AI crawler access.

The specific header configurations that create AI crawler problems:

`Cache-Control: private` — This header is intended for user-specific content (authenticated pages, personalized dashboards) and instructs CDNs not to cache the response. It is correct for those use cases, but it is frequently applied to public content pages by frameworks with overly conservative cache default settings. A public product page with `Cache-Control: private` will never be cached at the edge, forcing the origin to serve every AI crawler request directly. This is not a blocking issue, but it adds origin load and makes caching-based freshness control impossible.

`Cache-Control: no-cache` — This is often misunderstood. `no-cache` does not mean "do not cache" — it means "cache but revalidate on every request." For AI crawlers, this is actually the ideal behavior: the CDN caches the content but issues a conditional GET to the origin on each crawler request, ensuring the crawler always receives current content. If your origin correctly implements ETag or Last-Modified, the revalidation adds minimal overhead and ensures perfect freshness.

`Cache-Control: no-store` — This means "do not cache at all, at any layer." It is appropriate for highly sensitive personal data but is dramatically overused. Pages with `no-store` cannot be cached by AI crawlers in their own crawl caches, forcing full page downloads on every visit and consuming crawl budget inefficiently. Use `no-cache` instead of `no-store` for public content pages where freshness is important.

Missing `Vary` headers on compressed responses — If your origin server sends compressed responses (gzip or Brotli) without a `Vary: Accept-Encoding` header, CDNs may serve compressed content to AI crawlers that do not send Accept-Encoding headers, resulting in garbled content. This is a relatively rare problem but worth checking in your header audit.

The correct Cache-Control header configuration for public content pages that should be AI-crawler-accessible:

``` Cache-Control: public, max-age=3600, stale-while-revalidate=86400 Surrogate-Control: max-age=86400 ETag: "version-hash-here" Last-Modified: Mon, 25 May 2026 00:00:00 GMT Vary: Accept-Encoding ```

This configuration caches the page for 1 hour in the browser, 24 hours at the CDN edge, and allows stale serving while revalidation occurs (preventing latency spikes during high-traffic periods). The ETag and Last-Modified headers enable conditional GET requests that minimize bandwidth on revalidation.

CDN vs Origin Serving: When to Skip the Edge

For a subset of use cases, the cleanest solution to AI crawler access problems is not to optimize the CDN layer but to bypass it entirely for AI crawler requests. This approach has a higher origin load but eliminates the complexity of CDN-side configuration and guarantees that AI crawlers always receive exactly what the origin server returns.

The bypass approach makes sense when:

  • The site has frequent content updates (multiple times per day) and cache invalidation propagation delays create meaningful staleness windows
  • The CDN configuration is complex (multiple layers, custom VCL, edge workers) and the risk of misconfiguration is high
  • The site serves personalized or dynamic content that is difficult to cache correctly at the edge
  • The team lacks CDN expertise to implement and maintain the complex bot-specific cache rules required

The implementation is simple: use a CDN "pass" or "bypass cache" rule for AI crawler user-agents, routing their requests directly to the origin server. The origin must be able to handle the additional load — AI crawler requests typically represent less than 0.5% of total traffic for most mid-size sites, so origin capacity is rarely a constraint.

The trade-off to understand is that bypassing the CDN removes the geographic performance benefit of edge caching for AI crawler requests. Since AI crawlers are not latency-sensitive (they are not waiting for a page to load on screen), this trade-off is generally acceptable.

For a complete picture of how dynamic content and personalization interact with AI crawler serving strategies, see Personalization vs AEO: Why Dynamic Content Is Hurting Your AI Search Visibility.

The Edge Configuration Audit: A Step-by-Step Playbook

If you are reading this and are not certain whether your CDN configuration is harming AI crawler access, the following audit playbook will give you a clear answer in under four hours.

1. Pull 30 days of CDN access logs filtered for requests with user-agents containing: GPTBot, ClaudeBot, PerplexityBot, Amazonbot, Bytespider, meta-externalagent. If your CDN does not log to an accessible location by default, set up log shipping to S3 or a logging service before proceeding.

2. Calculate response code distribution for AI crawlers. If more than 10% of AI crawler requests receive non-200 responses (403, 429, 503, or redirects to challenge pages), you have an active blocking problem.

3. Check for CAPTCHA or challenge page serving. Sample 20 to 30 requests that received 200 responses and compare the response body length to the expected page length. A 200 response with a body length under 5KB for a page that should be 50KB+ is almost certainly a challenge page.

4. Measure cache hit rates for AI crawlers vs. total traffic. If AI crawlers have a significantly lower cache hit rate than your average, they may be triggering cache bypass rules intended for authenticated users or high-bot-risk paths.

5. Map the rate-limiting trigger rate. What percentage of AI crawler requests are rate-limited (429 responses)? Even a 5% rate-limiting trigger rate during a crawl burst can significantly reduce total crawl coverage.

6. Check cache age headers on AI crawler responses. The `Age` response header indicates how long the cached response has been sitting at the edge. If AI crawlers are regularly receiving responses with Age values of 5 days or more for content that is updated frequently, your cache TTLs are too long for those page types.

7. Verify robots.txt is consistent with CDN behavior. If your robots.txt allows GPTBot but your CDN WAF is blocking GPTBot, the CDN is overriding your stated robots.txt permission. These should be consistent.

8. Test from the crawlers' perspective using a VPN or proxy from a data-center IP. Make requests to your site using AI crawler user-agents from a known data-center IP range. The response you receive will approximate what the actual crawler sees, including challenge pages that might not be visible from a residential or office IP.

Measuring AI Crawler Access After Configuration Changes

The audit tells you what is happening now. After making configuration changes, you need to measure whether the changes improved AI crawler access. The measurement approach:

Server log re-analysis. Pull 30 days of post-change logs and repeat the response code distribution analysis from the audit. The target: 95%+ of AI crawler requests receiving 200 responses with appropriate cache headers.

GPTBot test crawl. OpenAI provides a tool at platform.openai.com/gptbot-scan that allows site owners to trigger a test crawl of specific URLs and receive a report on what GPTBot sees. This is the most direct verification method available for OpenAI's indexer.

Cache age monitoring. Set up an alert that fires when the median `Age` header value for AI crawler responses exceeds your TTL target. This alert catches TTL configuration drift — cases where an edge configuration change inadvertently extends the cache TTL for AI crawler requests.

Content freshness sampling. On a monthly cadence, query your major AI assistants for 10 to 20 specific facts that appear on recently-updated pages (product features, pricing figures, company data). Compare the AI-cited values to the current values on your site. A freshness score below 90% indicates ongoing cache staleness affecting AI citation accuracy.

For the broader measurement framework that connects CDN-level technical fixes to AI search visibility outcomes, the AEO citation tracking playbook provides the measurement layer that makes technical audit results actionable in a business context.

AI Crawler-Friendly CDN Setup: The Target State

After working through the configuration changes above, the target state for an AI-crawler-friendly CDN setup has six defining characteristics:

1. Recognized AI crawlers are explicitly allowlisted in WAF and bot management rules, bypassing challenge pages, Bot Fight Mode, and behavioral bot detection.

2. Rate limiting rules for AI crawlers allow at least 60 requests per minute, with a separate policy that is not co-mingled with the general bot rate limiting threshold.

3. Cache TTLs for frequently-updated content are set to 24 hours or less for AI crawler requests, with shorter TTLs for high-velocity pages (pricing, product feature pages, news content).

4. Cache-Control headers on origin responses use `public, max-age=3600` with `stale-while-revalidate` for content that can tolerate brief staleness, or `no-cache` with ETag for content where freshness is critical.

5. robots.txt permissions are consistent with CDN behavior — if you allow a crawler in robots.txt, the CDN must not block it at the network layer.

6. Monitoring is in place for AI crawler response code distribution, cache age, and freshness sampling, with alerts that fire before a problem degrades significantly.

This target state is achievable in a 2 to 4 week sprint for most organizations. The configuration work is not novel — it uses existing CDN features applied in a new way, with no custom development required. The business case is clear: AI search is generating measurable pipeline influence across B2B categories, and every day that CDN misconfiguration degrades AI crawler access is a day that content and SEO investments are not being indexed into the systems that an increasing share of buyers are using to discover vendors.

Takeaway: The CDN infrastructure that most B2B and media companies built to defend against scrapers and optimize for browsers is silently harming AI search visibility through three compounding problems: bot-management false positives that block AI crawlers entirely, cache TTLs that serve stale content during infrequent crawler visits, and rate-limiting rules that truncate crawl coverage mid-session. The fix is not a platform change — it is a set of deliberate configuration choices that require understanding how AI crawlers differ from Googlebot. Organizations that complete this configuration audit and implement the Cloudflare or Fastly-specific remediation steps will see measurable improvements in AI citation freshness and coverage within 30 to 60 days. Those that do not will continue investing in content and AEO programs that are partially invisible to the systems they are trying to influence.

Frequently Asked Questions

Does edge caching hurt AI search crawler visibility?

Edge caching can significantly hurt AI crawler visibility when configured for human browser traffic patterns rather than the distinct behavior of AI bots. AI crawlers like GPTBot, ClaudeBot, and PerplexityBot visit pages infrequently — typically every 10 to 21 days per URL — but they expect to retrieve the most current version of content when they do. If your CDN edge nodes are serving stale cached responses with TTLs set to 7 or 30 days, an AI crawler visiting on day 4 after a content update will receive outdated content that was current at cache prime time but may now be incorrect. Worse, many CDN configurations strip or modify Cache-Control headers in ways that prevent crawlers from knowing the content is cached at all. The fix is not to disable caching — it is to configure separate cache TTL policies for identified AI bot user-agents, set appropriate Surrogate-Control headers, and ensure cache invalidation events propagate to edge nodes before the next expected crawler visit window.

How should you configure Cloudflare or Fastly for AI crawler access?

For Cloudflare, the critical configuration changes are threefold. First, create a custom WAF rule that recognizes AI crawler user-agents — GPTBot, ClaudeBot, PerplexityBot, Amazonbot, and FacebookExternalHit — and explicitly bypasses the bot-fight-mode challenge page for these agents. Second, create a Cache Rule that sets a shorter TTL (24 to 48 hours maximum) for requests from these user-agents, ensuring they receive fresh content. Third, review your Rate Limiting rules to ensure AI crawler IPs are not being throttled below the crawl rate required for full site indexing. For Fastly, the equivalent steps involve creating VCL subroutines that set a custom pass condition for recognized AI crawler user-agents in vcl_recv, and modifying ttl values in vcl_fetch for those agents. Both platforms support user-agent based routing in their edge logic layers, and both require explicit configuration — the defaults are built for browser traffic and will harm AI crawl coverage without intervention.

What cache headers should be set for GPTBot and ClaudeBot?

The most important header configuration for AI crawler access is Cache-Control: max-age=0, must-revalidate for content that changes frequently, combined with a Surrogate-Control: max-age=3600 header that instructs the CDN layer to cache aggressively while forcing revalidation at the origin on crawler requests. For pages that change less frequently, Cache-Control: public, max-age=86400 is appropriate, but you should pair it with a Last-Modified or ETag header so crawlers can perform conditional GET requests and detect staleness without downloading the full page. The Vary header is also important: if your CDN is serving different content based on Accept-Language or other request properties, ensure the Vary header is set correctly so AI crawlers receive the canonical version of the page rather than a locale-specific or device-specific variant. Finally, never set no-store for pages you want AI-indexed — this directive prevents caching at all layers including the crawler's own cache, forcing full re-download on every visit and consuming crawl budget unnecessarily.

How often do ChatGPT and Perplexity crawlers visit a site?

Based on server log analysis across multiple mid-size and enterprise sites, GPTBot visits individual URLs approximately every 14 to 28 days, with high-authority pages receiving visits as frequently as every 7 days. PerplexityBot operates on a faster cycle for news and frequently-updated content — roughly every 3 to 7 days for pages it considers high freshness priority — but visits static or rarely-updated pages every 30 to 60 days. ClaudeBot's crawl frequency is harder to characterize from public data, but available log samples suggest roughly 14 to 21 days between visits to the same URL. These intervals are dramatically longer than Googlebot's crawl frequency (which can be hourly for high-authority news sites), and they have major implications for content update strategy. A page updated the day after an AI crawler visit will not be re-crawled for two to four weeks, meaning product claim changes, pricing updates, and corrected factual errors are invisible to AI assistants for an extended window.

What is the most common CDN misconfiguration that blocks AI crawlers?

The most common and damaging CDN misconfiguration for AI crawler access is bot-management or bot-fight-mode blocking that treats AI crawlers as malicious scrapers. Cloudflare's Bot Fight Mode, Akamai's Bot Manager, and Fastly's bot protection features all use behavioral and IP-reputation signals to identify bots, and the default training sets for these systems were built during an era when the primary bot threat was content scraping and credential stuffing. AI crawlers appear similar to scrapers in behavioral terms — they make rapid sequential requests, often from data-center IP ranges, with non-browser user-agents. Without explicit allowlist rules for recognized AI crawler user-agents, these bot management systems will serve CAPTCHAs, JavaScript challenges, or 403 responses to GPTBot, ClaudeBot, and PerplexityBot, effectively making the site invisible to AI search indexing. The fix is straightforward: add the published IP ranges and user-agent strings for major AI crawlers to your allowlist, and verify the configuration with server log analysis after deployment.