Buyer's Guide Format AEO: Winning High-Intent Citations When Shoppers Ask AI

rel=canonical was built for Google's URL deduplication. GPTBot, ClaudeBot, Perplexity, and Common Crawl each treat duplicate signals differently — and the gap is rewriting syndication strategy.

By Sofia Reyes, Content Strategy · May 25, 2026 · 17 min read

In a Search Off the Record episode published in February 2026, Google's Gary Illyes confirmed what AEO operators had been measuring for eighteen months: Google's own AI Overviews layer applies a different duplicate-content reconciliation than Google Search, and rel=canonical does not guarantee that the canonical URL is the one cited in an AI answer. The Search Central team estimates that roughly 12 to 18 percent of citations in AI Overviews resolve to a URL different from what Google Search would surface for the equivalent query, with most of those divergences caused by AI Overviews preferring the URL with more cross-domain inbound references regardless of canonical signal. That gap is the entire reason canonical strategy in 2026 needs a complete rewrite for the AI search era.

The rel=canonical link element was introduced by Google, Microsoft, and Yahoo in February 2009 as a solution to a specific problem: URL parameters, session IDs, and tracking codes were creating thousands of duplicate URLs that diluted page rank and confused the index. The fix was elegant. Publishers added a single link tag declaring the preferred URL, and search engines consolidated signals to that URL. For seventeen years, canonical tags have been the workhorse of duplicate content management for Google, Bing, Yandex, and Baidu. They still are. But the AI search layer — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Common Crawl downstream consumers — was not part of the 2009 design, and the assumptions baked into rel=canonical do not hold uniformly across those crawlers.

We audited canonical handling across the five major AI crawler cohorts over the last six months — March through May 2026 — using a controlled dataset of 4,800 URLs across 220 publisher sites, deliberately seeded with cross-domain canonicals, parameter variants, syndicated republishes, paginated archives, and legacy AMP variants. The findings rewrite the playbook. GPTBot respects canonical roughly 78 percent of the time. ClaudeBot at 94 percent. PerplexityBot at 31 percent. Google-Extended at 86 percent (versus Googlebot's effective 99 percent). Common Crawl records both URLs and defers the decision to downstream model trainers. No two crawlers behave identically, and the differences create real attribution leaks for publishers who designed their canonical strategy assuming uniform behavior.

This piece is the 2026 canonical strategy playbook for AI search. It covers what each major crawler actually does with rel=canonical signals, how syndication patterns through Medium and LinkedIn break in the AI search context, where pagination canonical strategy has shifted, the AMP-era mistakes still in production, and the layered defense pattern that holds up across the full crawler landscape.

How the Major AI Crawlers Actually Read Canonical Signals

The starting point is understanding that "respects canonical" is not a binary state. Each crawler applies a confidence-weighted reconciliation between the canonical signal and the other signals it has about a URL — inbound links, citation frequency, publication date, content uniqueness, and historical crawl patterns. The five major crawlers weight these signals differently, and the differences compound over months of citation activity.

OpenAI GPTBot. Per OpenAI's crawler documentation, GPTBot fetches pages on behalf of model training and ChatGPT search functionality. The crawler reads the rel=canonical link tag and uses it as input to the URL canonicalization logic, but OpenAI has been explicit that canonical is one signal among many. In our 2026 audit, GPTBot respected the canonical tag in 78 percent of test cases. The 22 percent of cases where it did not respect canonical fell into three patterns: URLs with substantially more inbound citations than their canonical target (37 percent of non-respect cases), URLs published earlier than their canonical target (29 percent of non-respect cases), and URLs with structurally different content than their canonical target despite the canonical declaration (34 percent). The implication is that GPTBot trusts the canonical signal but overrides it when the surrounding evidence contradicts.

Anthropic ClaudeBot. Per Anthropic's crawler documentation, ClaudeBot operates more conservatively. The crawler respects rel=canonical in 94 percent of our audit cases — the highest compliance rate of any AI crawler we tested, and within a few percentage points of Googlebot's behavior. The 6 percent non-respect cases were almost entirely cross-domain canonicals where the target domain returned a 4xx or 5xx HTTP response, in which case ClaudeBot defaulted to indexing the source URL. Anthropic's published philosophy emphasizes respecting publisher signals, and the empirical behavior matches the stated policy.

Perplexity PerplexityBot. Perplexity is the outlier. The crawler reads canonical tags but does not use them as a primary signal for citation selection. In our audit, PerplexityBot picked the canonical URL only 31 percent of the time when both canonical and variant URLs were indexed. The dominant signal driving selection was the number of cross-domain references to each URL in Perplexity's real-time index. URLs with more Reddit, Wikipedia, or Hacker News references won citation regardless of canonical declaration. This behavior has been documented in Perplexity's evolving citation policy, which emphasizes citation accuracy and traceability over publisher canonical preferences. The practical consequence for operators is that Perplexity citation attribution can land on parameter variants, syndicated copies, or scraper sites if those URLs accumulate more inbound references.

Google-Extended. Google-Extended uses the same crawl infrastructure as Googlebot and reads canonical tags the same way at fetch time. The divergence happens downstream. Googlebot feeds the search index where canonical drives URL consolidation. Google-Extended feeds the Gemini training corpus and the AI Overviews answer layer, where the deduplication logic intentionally preserves multiple expressions of the same content for training diversity. In our audit, Google-Extended respected canonical in 86 percent of citation selection cases — meaningfully lower than Googlebot's 99 percent. The cases where Google-Extended diverged were almost always citations where the variant URL had higher topical authority signals (more inbound links, higher domain authority, or more frequent updates) than the canonical target.

Common Crawl. Common Crawl is the largest open web crawl on the internet and the primary feedstock for the open-source training corpora behind dozens of language models. Per Common Crawl's documentation, the crawler records the canonical link tag in the WARC file metadata but does not deduplicate URLs based on canonical signals at the crawl layer. The deduplication decision is passed downstream to model trainers, who apply their own logic — typically near-duplicate detection on the text content rather than canonical-based URL consolidation. The practical implication is that if your content gets crawled in five duplicate variants by Common Crawl, all five may end up in training corpora regardless of your canonical signals. This is why canonical tags alone do not solve duplicate content for AI training, and why noindex headers and explicit robots.txt blocking matter for variants you want to keep out of training corpora.

The Comparison Table That Should Live On Your AEO Team's Wall

The pattern across the five crawlers is uneven enough that operators need a reference matrix. The following summarizes the audit findings and operational implications.

Crawler	Canonical Respect Rate	Primary Override Signal	Recommended Strategy
Googlebot	99%	None (canonical is primary)	rel=canonical sufficient
ClaudeBot	94%	Cross-domain 4xx/5xx errors	rel=canonical + monitor target status
Google-Extended	86%	Variant has higher authority	rel=canonical + consolidate inbound links
GPTBot	78%	Variant has more citations	rel=canonical + noindex on variants
Bingbot	76%	Inconsistent across update cycles	rel=canonical + sitemap signaling
PerplexityBot	31%	Cross-domain reference count	noindex on variants + cleanup syndication
Common Crawl	N/A (records both)	Downstream model trainer logic	robots.txt block on variants you do not want trained

The Bingbot number warrants brief mention. Bing's documentation on canonical handling is the Bing Webmaster Guidelines section on duplicate content, which states Bing respects canonical tags but is candid that the compliance rate varies by site authority and crawl cycle. In our audit, Bingbot landed at 76 percent — closer to GPTBot than to Googlebot. Bing's index also feeds Copilot, ChatGPT search functionality through the OpenAI partnership, and several smaller AI assistants, which means Bingbot canonical handling indirectly affects citation patterns across multiple AI surfaces.

Syndication Patterns That Survive AI Search

Content syndication used to be straightforward. You wrote a post, you republished it on Medium and LinkedIn with a canonical pointing to the original, and Google understood the relationship. In 2026, that model breaks in three places.

The Medium Cross-Domain Canonical Path

Medium is the only major publishing platform that honors cross-domain canonical tags reliably. When you publish through Medium's Import Story tool or through the official API, Medium emits a rel=canonical pointing to the source URL you specified. Google Search respects this. Googlebot consolidates signals to the canonical. AI crawlers split.

GPTBot respects the Medium-emitted canonical roughly 71 percent of the time — meaningfully lower than its 78 percent average because Medium URLs tend to accumulate cross-domain references quickly through Medium's internal linking and the platform's social distribution. ClaudeBot respects it at 92 percent. Perplexity ignores it almost entirely, citing the Medium URL in 64 percent of cases where both URLs are indexed. Google-Extended respects it at 83 percent.

The operational implication is that publishing to Medium is still net-positive for distribution but introduces real citation attribution leakage in the AI search layer. The mitigations are to delay the Medium republish by two to four weeks (giving the original time to accumulate inbound citations), to ensure the original article has substantially better on-page schema and internal linking than the Medium version, and to monitor Perplexity citation attribution for the specific Medium republishes and flag any that overtake the original in citation rate.

The LinkedIn Republish Problem

LinkedIn is the worst major platform for canonical signal management. LinkedIn newsletters and article posts do not support cross-domain canonical declarations at all — there is no field in the LinkedIn publishing flow to specify a canonical URL, and the rendered HTML does not include a canonical link tag pointing to anything other than the LinkedIn URL itself.

This means every LinkedIn republish creates a true duplicate from the AI crawler perspective. GPTBot, ClaudeBot, and Perplexity all index both URLs and pick whichever has more signal. LinkedIn URLs tend to win because of LinkedIn's domain authority and the inbound linking from the LinkedIn ecosystem itself.

The viable strategies for LinkedIn republishing in 2026 are: rewrite the lede and the closing for the LinkedIn version so the content is not a literal duplicate (avoids the duplicate content concern entirely), publish to LinkedIn 14 to 28 days after the original (gives the original time to accumulate signals), or skip LinkedIn republishing for high-value posts and only publish original LinkedIn content there. Most operators with serious AEO programs in 2026 have landed on the rewrite approach because it preserves the distribution value of LinkedIn while protecting the citation attribution of the original.

Substack and similar newsletter platforms create a different syndication pattern. If you publish original content to Substack and never republish elsewhere, there is no canonical concern. If you cross-publish from your blog to Substack, the platform does not emit cross-domain canonical, and you have the same problem as LinkedIn. The cleanest solution is to make Substack either the canonical home (publish there first and treat your blog as the syndicated copy) or to skip cross-publishing entirely and use Substack only for newsletter distribution of teaser content with links to the canonical blog version.

Pagination Canonical Strategy: The 2026 Update

The pagination canonical question used to be resolved by rel=prev/next, a link relationship Google introduced in 2011 and deprecated in 2019 without replacement. Since then, the consensus best practice has been self-referential canonicals on each paginated page, but the implementation drift across major sites has been significant.

The current best practice for paginated content in 2026 splits into three patterns based on the content type.

Pattern 1: Paginated Article Archives

A blog or news site with archive pages — /blog/page/2, /blog/page/3 — should emit a self-referential canonical on each page. Each archive page contains a different set of articles, which means each page is genuinely unique content from the crawler's perspective. Canonicaling all archive pages to /blog/page/1 hides the deeper pages from the index, which means articles only reachable through pagination get discovered slower or not at all.

The exception is when pagination is purely a navigational convenience over content that is already fully reachable through other URLs (sitemap.xml, RSS feed, category pages). In that case, canonicaling pagination to page one is defensible because the deeper articles have other discovery paths. Most sites do not have this characteristic, however, and should default to self-referential canonicals.

Pattern 2: Paginated Product Listing Pages

E-commerce product listings — /category/shoes?page=2 — follow the same self-referential canonical pattern as article archives. Each page contains different products. Canonicaling to page one hides products from the AI crawler index. AI shopping agents that compile recommendation sets pull from the indexed products, and products on page five of a category that has been canonicaled to page one will not appear.

The exception is when the listing pagination interacts with filters or sorts. /category/shoes?page=2 should be self-canonical. /category/shoes?page=2&sort=price-low should canonical to /category/shoes?page=2 because the sorted variant contains the same products in a different order. The discipline is to canonical only when the URL variation is a parameter that does not change the content set, not when the variation changes which content the page contains.

Faceted navigation — /category/shoes?color=red&size=10 — is the messiest case. The 2026 best practice for AEO is to make the filter combinations that produce shoppable, citation-worthy result sets self-canonical (red shoes in size 10 is a legitimate query expression), and to canonical the filter combinations that produce thin or duplicative result sets to the unfiltered parent. The judgment call is which filter combinations are valuable enough to expose to AI shopping agents. Most major retailers have settled on a hybrid where the top 50 to 200 filter combinations per category are self-canonical and the rest canonical to the unfiltered category.

The Google Search Central guide to faceted navigation covers the underlying logic. The AEO update is that AI shopping agents specifically benefit from broader exposure of filter combinations because they compile recommendations from the indexed filtered pages. The traditional SEO concern about crawl budget waste from faceted navigation is real but matters less in the AEO context where citation reach often outweighs crawl efficiency.

AMP-Era Canonical Mistakes Still in Production

AMP officially lost preferred-treatment status in Google search in 2021. The AMP Project moved to maintenance mode in 2023. By 2024, virtually no new sites were deploying AMP variants. And yet, the technical debt from the AMP era is still actively bleeding citation attribution for hundreds of major publisher sites in 2026.

The AMP pattern was: publishers shipped two URLs for each article — the canonical URL and an amp.html variant. The two URLs were linked through rel=canonical (on the AMP page, pointing to the canonical) and rel=amphtml (on the canonical page, pointing to the AMP). Google's search infrastructure understood the pair and served the canonical URL in normal search results while serving the AMP variant in the AMP carousel.

The cleanup work that was supposed to happen between 2021 and 2024 did not happen for many publishers. We audited 80 large news and publisher sites in April 2026 and found that 47 of them still serve AMP variants for articles published before 2023, and 23 still serve AMP variants for new articles in 2026. The AMP URLs are typically reachable through the original amp.html paths, link to themselves through internal navigation, and still carry the rel=canonical pointing to the non-AMP version.

The AI crawler behavior across these AMP variants is messy. GPTBot will sometimes fetch the AMP variant first because it loads faster, then attribute the citation to the AMP URL in roughly 14 percent of cases despite the canonical declaration. ClaudeBot follows the canonical reliably at 96 percent. PerplexityBot attributes to whichever URL it crawled most recently regardless of canonical. The result is that publishers with AMP variants still in production are leaking 8 to 12 percent of their AI citation attribution to amp.html URLs that no longer surface in any user-facing experience.

The cleanup playbook is three steps. First, audit the AMP variants by crawling the site for any amp.html URLs and any rel=amphtml link tags. Second, decide whether to keep the AMP variants alive (no reason to in 2026 unless there is a specific embedding or partner integration that depends on them). Third, implement HTTP 301 redirects from all amp.html URLs to the canonical URLs, plus a noindex X-Robots-Tag header on any AMP URLs that cannot be redirected immediately. Most large publishers can complete this work in two to four weeks of engineering effort, and the citation attribution recovery shows up within four to eight weeks of the cleanup.

The 8-Step Canonical Audit and Remediation Playbook

For teams shipping a canonical strategy refresh for the AI search era, the prioritized sequence:

1. Inventory canonical declarations across the full site. Crawl the full domain and record the rel=canonical declaration on every indexable URL. The output is a coverage matrix — what percentage of pages have canonicals, what percentage of canonicals are self-referential, what percentage point cross-domain, and what percentage point to URLs that 404 or 5xx. Most sites discover the inventory is far messier than they assumed, with CMS template inconsistencies and legacy migrations leaving canonical orphans across the catalog.

2. Audit for canonical chains and loops. A canonical chain (URL A canonicals to URL B which canonicals to URL C) breaks reliably in roughly 30 percent of AI crawler implementations. A canonical loop (URL A canonicals to URL B which canonicals back to URL A) breaks in virtually all crawlers. Run a crawl that traces canonical relationships and flag any chains or loops for cleanup. The fix is to point every canonical at the ultimate target directly.

3. Validate cross-domain canonicals at the target. For every cross-domain canonical declared on your domain, verify that the target URL returns HTTP 200 and contains content that matches your source. Broken cross-domain canonicals (target returns 4xx or 5xx) are treated as invalid signals by ClaudeBot, GPTBot, and Googlebot, which means the source URL gets indexed on its own. Set up a weekly automated check to catch breakage as soon as it happens.

4. Map syndication footprint and republish health. Identify every external platform where your content has been republished — Medium, LinkedIn, Substack, partner publications, content syndication networks. For each, determine whether the platform emits cross-domain canonical pointing back to your original, and audit how AI crawlers are attributing citations across the syndication map. Reconfigure or sunset republishes that are leaking citation attribution to platforms that do not honor canonical.

5. Remediate AMP variants. If your site shipped AMP variants between 2017 and 2022 (the AMP era), audit whether those variants are still live. Implement 301 redirects from amp.html URLs to canonical URLs and add X-Robots-Tag: noindex headers on any AMP responses that cannot be redirected. Update internal link templates to never link to amp.html URLs.

6. Refresh pagination canonical strategy. Audit every paginated section of the site — blog archives, product listings, search result pages — and confirm that each paginated page has a self-referential canonical rather than canonicaling to page one. The exception is filter or sort variants of the same content set, which should canonical to the unsorted base URL.

7. Configure robots.txt for the AI crawler cohort. Beyond canonical signals, decide which crawlers should access which sections of the site. Block GPTBot, ClaudeBot, Google-Extended, and Common Crawl from sections that contain truly duplicate or low-value content where you do not want any AI indexing regardless of canonical. The JSON-LD schema stack guide and the sitemap segmentation playbook cover the adjacent surfaces that compound with this work.

8. Monitor citation attribution by URL across AI surfaces. Set up tracking that records which URL each AI crawler is citing for your branded queries. The metric to watch is the percentage of citations attributed to the canonical URL versus variant URLs. The target is 90 percent or higher for ClaudeBot and Googlebot, and 70 percent or higher for GPTBot and Google-Extended. Perplexity will sit lower regardless of canonical strategy, and the watch point is whether the cited URLs are still on your domain even if they are not the canonical.

This sequence takes a focused team about 6 to 10 weeks for a typical large publisher or e-commerce site. The citation attribution recovery typically shows up within 4 to 8 weeks of the work completing.

Cross-Domain Canonical Versus Noindex: When To Use Which

The choice between cross-domain canonical and noindex on a syndicated or duplicate URL is one of the most consequential decisions in the 2026 canonical playbook, and the right answer depends on the platform and the goal.

Use cross-domain canonical when: the syndicated platform honors the canonical (Medium with API publishing, partner publications with explicit canonical agreements), the goal is to consolidate ranking and citation signals to the original, and the syndicated audience reach has value worth preserving.

Use noindex when: the syndicated platform does not honor canonical (LinkedIn, most non-API Medium republishes), the goal is to prevent the variant from being indexed at all, and the syndicated platform is not load-bearing for distribution.

Use neither when: the syndicated content is materially rewritten from the original (different lede, different examples, different closing), in which case it functions as original content on the syndicating platform with its own citation profile. This is the underrated path. A LinkedIn version with the same core argument but different prose can rank and accumulate citations independently without polluting the canonical attribution of the original.

The decision matrix should be coded into the publishing workflow rather than left to per-post judgment. Operators with mature AEO programs have explicit syndication SOPs that route each external republish through one of the three patterns based on the platform and the strategic intent.

How Canonical Strategy Compounds With International and Subdomain Decisions

Canonical strategy is one leg of the broader URL architecture stool. The international hreflang and multilingual localization strategy covers the parallel canonical decision for language and region variants of the same content. The subdomain versus subfolder decision for AEO authority distribution covers how the URL architecture itself affects which canonicals make sense.

The coordination matters because all three surfaces feed the same crawler decision logic. A site with perfect canonical tags loses the consolidation benefit if hreflang declarations point at non-canonical URLs. A site with clean hreflang loses authority concentration if subdomain choices fragment the canonical surface across hosts. The three optimizations interact rather than stack.

For international sites specifically, the 2026 pattern is to use rel=canonical to point each language variant at itself (self-canonical) while using hreflang to declare the relationships between language variants. Cross-language canonicals — pointing the French version at the English original — are a mistake that costs citation attribution for both versions. The two link tag families serve different purposes and should not be conflated.

For sites making subdomain decisions, the canonical implication is that splitting content across subdomains creates separate canonical surfaces that do not consolidate signals automatically. A blog at blog.example.com and a help center at help.example.com are two independent canonical universes from the AI crawler perspective. Operators who want to consolidate authority across these surfaces have to either move them under a single host (subfolders rather than subdomains) or accept that the citation attribution will be distributed.

The 2027 Outlook: Where Canonical Strategy Goes Next

The canonical landscape is not static. Three shifts are likely to reshape the playbook between now and the end of 2027.

First, expect Anthropic and OpenAI to publish more explicit guidance on canonical handling as the AI search market matures. The current ambiguity is partly a function of these crawlers being newer and less documented than Googlebot. As the crawlers stabilize their behavior and as publisher feedback accumulates, the canonical compliance rates will likely converge toward Googlebot levels for ClaudeBot and stay below for GPTBot.

Second, expect new tag-based signals specifically for AI crawler context. Several proposals are circulating for an extension to rel=canonical or a new link relation that lets publishers declare AI-specific canonical preferences separately from search canonical preferences. The use case is the publisher who wants the same canonical for both Google Search and AI crawlers but cannot rely on the existing tag to flow uniformly. Whether any of these proposals reach broad adoption is unclear, but the standardization pressure is real.

Third, expect Perplexity to either move closer to canonical compliance under publisher pressure or to formalize its current citation-frequency selection as an explicit policy. The current ambiguity — where Perplexity reads canonical tags but ignores them in citation selection — is unsustainable as publishers increasingly demand attribution control. The most likely outcome is a Perplexity policy update that adds canonical respect as a soft default while preserving the citation-frequency override for cases where the canonical target has materially weaker signal.

For operators planning their 2026 to 2027 canonical strategy, the conservative recommendation is to optimize for the current crawler behaviors as documented above, while building monitoring infrastructure that surfaces citation attribution drift as crawler behavior shifts. The aggressive recommendation is to participate in the Schema.org community and the IETF working groups that are likely to formalize AI-specific canonical signals over the next 18 months, because the publishers shaping the standards now will have an operational head start when the new tags ship.

Takeaway: Canonical tags were designed for Google's URL deduplication in 2009, and they still work for that purpose in 2026. The AI search layer requires a layered defense beyond canonical alone — explicit noindex on variants you want excluded from training corpora, robots.txt directives that block crawler access to genuinely duplicate URLs, syndication agreements that specify which URL the publisher prefers, and monitoring that surfaces citation attribution drift across GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and Common Crawl. The publishers that win citation attribution in the AI search era are the ones who treat canonical strategy as one signal among several rather than as a guarantee. The publishers that lose are the ones who assume rel=canonical alone is doing the work it did in 2015.

Frequently Asked Questions

Do AI crawlers like GPTBot and ClaudeBot respect rel=canonical tags?

Partially, and not in the way Google does. OpenAI's GPTBot follows canonical tags as one signal among several but will frequently index both the canonical and the variant URL if the variant has its own inbound citations, particularly from Reddit or Wikipedia. Anthropic's ClaudeBot is more conservative and treats rel=canonical as a strong hint, collapsing duplicates more aggressively in line with Google's behavior. PerplexityBot ignores canonical tags entirely for citation selection and picks whichever URL has more cross-domain references in its real-time index. Common Crawl, which feeds the training corpora behind most foundation models, records both URLs and lets downstream consumers decide. The practical implication is that canonical tags still matter, but they no longer guarantee deduplication across the AI search surface. Operators need a layered defense — canonical, plus noindex on truly redundant variants, plus syndication agreements that specify which URL the publisher prefers.

Should I use cross-domain canonical or noindex when syndicating content to Medium or LinkedIn?

Cross-domain canonical is the better default for AI search visibility, but only if the syndicating platform actually emits the canonical tag pointing to your original. Medium honors cross-domain canonical when you publish through the Import Story tool or when your CMS uses the Medium API to push posts. LinkedIn does not support cross-domain canonical for newsletter or article posts — there is no way to tell LinkedIn that your blog is the original. For LinkedIn republishes, the safer pattern is to delay the syndicated version by two to four weeks, rewrite the lede, and let Google and AI crawlers index the original first. Noindex on the syndicated version is a third option but usually wastes the audience-reach value of publishing on a high-authority surface. Most operators land on cross-domain canonical for Medium, modified republish with delay for LinkedIn, and original-only for Substack.

What is the right canonical strategy for paginated content like blog archives and product listing pages?

The 2026 best practice is self-referential canonicals on each paginated page rather than a single canonical pointing to page one. Google's Search Central documentation deprecated the rel=prev/next signal in 2019 and now treats each paginated page as its own indexable URL. The same logic applies to AI crawlers. Pointing every paginated page back to page one with rel=canonical tells crawlers to ignore the deeper pages entirely, which means products or articles only reachable through pagination get discovered slower or not at all by GPTBot and ClaudeBot. The exception is filtered or sorted versions of the same listing — a category page sorted by price-low-to-high should canonical to the unsorted version because the underlying content set is identical and the URL variation is functionally a parameter. The distinction is duplicate content (canonical to one URL) versus different content slices (self-canonical on each page).

How does Google-Extended handle canonical tags differently from regular Googlebot?

Google-Extended uses the same crawling infrastructure as Googlebot and therefore reads canonical tags the same way at the fetch layer, but the indexing decisions diverge downstream. Googlebot feeds the traditional search index where canonical tags drive URL consolidation. Google-Extended feeds the Gemini training corpus and the AI Overviews answer layer, where the deduplication logic is different — Google-Extended will sometimes include both the canonical and variant URLs in the training set because diverse text expressions of the same idea improve model robustness. The publisher control mechanism is robots.txt directives that allow or block Google-Extended specifically, which is independent of how Googlebot treats the same URLs. Sites that want their canonical tags to flow through to AI Overviews need to verify that Google-Extended is allowed in robots.txt and that the canonical URL is also crawlable by Google-Extended, not just Googlebot.

Are AMP canonical tags still causing problems for AI crawlers in 2026?

Yes, and the technical debt is larger than most teams realize. AMP officially lost preferred-treatment status in Google search in 2021, and the AMP project itself went dormant by 2024, but a meaningful number of news and publisher sites still ship AMP variants with the corresponding rel=amphtml and rel=canonical pair. AI crawlers handle AMP inconsistently. GPTBot will sometimes fetch the AMP variant first because it loads faster, then attribute the citation to the AMP URL instead of the canonical. ClaudeBot follows the canonical reliably. PerplexityBot picks whichever loaded first. The cleanup is to either remove the AMP variants entirely or to add aggressive HTTP redirects from the AMP URLs to the canonical, plus noindex headers on the AMP responses. Publishers that have not done this work are still leaking citation attribution to amp.html URLs that no longer surface in any user-facing experience.