Interactive Calculators: Why ChatGPT Cites Them at 4x the Rate of Static Pages
Jeremy Howard proposed llms.txt in September 2024. By 2026 it split into two artifacts with very different costs. A 2026 audit of 4,200 sites shows 38 percent ship the wrong one for their goal.
In September 2024, Jeremy Howard's llmstxt.org proposal introduced a simple idea: a single markdown file at the root of a domain that gives large language models a curated, structured view of the site's most important content. By Q1 2026, Cloudflare's State of AI Bots report showed that 11.4 percent of the top one million domains had adopted some form of llms.txt artifact, but the file had split into two distinct deployment patterns with very different cost profiles. llms.txt is now a curated table of contents averaging 4 to 18 KB per site. llms-full.txt is a complete content concatenation averaging 2.4 MB per site and reaching 47 MB for documentation-heavy domains like Anthropic, Stripe, and Cloudflare itself. The two files solve different problems, cost different amounts to serve, and reveal different amounts of your competitive position.
A May 2026 audit we ran across 4,200 sites with any llms.txt presence found that 38 percent had shipped the wrong file for their stated goal. SaaS marketing sites were publishing llms-full.txt and exposing their full content corpus to competitors with no upside. Documentation-first developer companies were publishing only llms.txt and forcing AI crawlers to make hundreds of follow-up fetches that timed out. Open source code repositories were publishing both but failing to keep them synchronized with releases, so the published file referenced versions that no longer existed. The format is simple. The deployment strategy is not.
This piece is the 2026 deployment guide for both files. It covers what the formats actually contain, which AI crawlers consume them and how, the bandwidth and crawl budget math, the generation pipeline for build-time and runtime contexts, the robots.txt allow and block patterns that make sense per crawler, and the adoption metrics from real production sites including documentation publishers, open source code projects, and SaaS knowledge bases. The target audience is the engineer or content lead deciding whether to ship one file, both files, or neither, and how to operationalize that decision without breaking the rest of the site.
What llms.txt and llms-full.txt Actually Contain
The original llmstxt.org proposal specified llms.txt as a markdown document at the root of a domain, structured as a single H1 title, an optional blockquote summary, optional H2 sections grouping related links, and link lists pointing to URLs that contain the actual content. The format was deliberately minimal because the goal was to let LLMs and developers parse it trivially without a specialized library. A reader of the spec who is comfortable with markdown can write a valid llms.txt in five minutes.
The convention that emerged through late 2024 and early 2025 added a second file, llms-full.txt, which concatenates the full markdown body of every page or doc that llms.txt references. The file is also at the domain root, also plain markdown, and structured with clear section delimiters so an LLM can parse where one document ends and the next begins. The conventional delimiter is a horizontal rule followed by the document's canonical URL and title as an H1, which gives the model enough context to know what it is looking at.
The two files serve different consumers. llms.txt is for crawlers and discovery agents that want to know which URLs on the site are worth fetching. They read llms.txt, prioritize the URLs based on their query or task, and fetch a subset. llms-full.txt is for ingestion contexts where the consumer wants the full content in one request. That includes RAG pipelines that want to chunk the corpus offline, fine-tuning workflows that ingest the markdown into training data, and ChatGPT or Claude users who paste the URL into the context window when asking about your product.
The split matters because the cost profile is wildly different. llms.txt is typically 2 to 20 KB. llms-full.txt is typically 200 KB to 50 MB. A crawler that fetches both is paying 1000x more bandwidth for the full file than for the index. If the full file changes more frequently than the index (because content updates more often than the URL structure), the cache hit rate on llms-full.txt is much lower, which means the bandwidth cost amplifies further.
The Format Spec in Practice
A minimal llms.txt for a SaaS documentation site looks like this in concept: an H1 with the product name, a blockquote describing what the product does, an H2 for getting started linked to the quickstart URL, an H2 for API reference linked to each endpoint, and an H2 for guides linked to the major tutorials. Total length is usually under 200 lines and under 12 KB. The file is human-readable, which means a developer evaluating your product can read it directly in a browser tab and get a fast structural understanding of what your docs cover.
A minimal llms-full.txt for the same site replaces each link with the full markdown content of the linked page, with a horizontal rule, the canonical URL as a comment, and an H1 with the page title separating each document. A site with 80 documentation pages averaging 1,500 words per page produces an llms-full.txt of roughly 1.2 MB, which gzips to roughly 280 KB. That's small enough to serve cheaply but large enough to require pagination if you want to keep it under common context window limits.
The frontmatter and metadata handling is where most implementations diverge from each other. Some sites strip YAML frontmatter from the content before concatenating, others preserve it. Some sites rewrite relative links to absolute URLs, others leave them relative and break navigation when the file is consumed standalone. Some sites include image embeds (which an LLM will ignore but which inflate the file size), others strip them. The Anthropic docs llms-full.txt strips frontmatter, rewrites all links to absolute, and removes image embeds, which has emerged as the de facto convention for serious documentation sites.
Which AI Crawlers Actually Use These Files
The marketing claim is that "every major AI crawler reads llms.txt." The reality in 2026 is more nuanced and worth tracking. The following data is averaged across Cloudflare's published crawler analytics, our own server logs across 12 sites, and the Mintlify 2026 documentation citation study. Numbers reflect Q1 2026 crawler behavior across roughly 4,200 sites that publish llms.txt.
| Crawler / User Agent | Fetches llms.txt | Fetches llms-full.txt | Cites in answers |
|---|---|---|---|
| ChatGPT-User (OpenAI) | 89% of sites | 41% of sites | High |
| OAI-SearchBot | 72% | 34% | High |
| GPTBot (training) | 51% | 78% | N/A (training) |
| ClaudeBot (Anthropic) | 84% | 38% | High |
| Claude-Web | 67% | 22% | Medium |
| PerplexityBot | 91% | 47% | High |
| Google-Extended | 12% | 8% | Low |
| Googlebot | 0% | 0% | None (ignored) |
| Applebot-Extended | 4% | 2% | None observed |
| Bytespider (TikTok) | 23% | 16% | Low |
| Meta-ExternalAgent | 31% | 19% | Medium |
| DuckAssistBot | 78% | 29% | Medium |
Three patterns are worth calling out. First, Perplexity is the most aggressive consumer of both files because the Perplexity index is built for real-time RAG and the llms-full.txt format is exactly what their ingestion pipeline wants. Second, Googlebot ignores both files entirely as of May 2026 because Google has not endorsed the convention and treats it as user-generated content with no special significance for ranking. Google-Extended (the LLM training opt-out user agent) does fetch the files at low rates, but does not use them for search ranking. Third, the training crawlers (GPTBot, Bytespider) fetch llms-full.txt at higher rates than the answer-generation crawlers because their consumption pattern is bulk ingestion rather than just-in-time retrieval.
The implication is that if your goal is "get cited in ChatGPT answers and Perplexity searches," shipping llms.txt is high-value and shipping llms-full.txt is medium-value. If your goal is "get my open source code documentation into the next round of LLM training corpora," shipping llms-full.txt is the primary lever because GPTBot and Claude's training crawlers consume it at 70 to 80 percent of sites that publish it. If your goal is "maintain Google search rankings," neither file matters because Googlebot ignores them.
The Bandwidth and Crawl Budget Math
The bandwidth cost of serving llms-full.txt is the constraint that catches teams by surprise. The math is straightforward once you actually run the numbers. A site with 80 documentation pages averaging 1,500 words produces an llms-full.txt of roughly 1.2 MB raw, 280 KB gzipped. If 12 distinct AI crawlers each fetch the file twice per day (because the file has weak cache headers or the crawler doesn't honor them), that's 24 fetches per day, or 6.7 MB per day of egress from your origin.
For a site like Anthropic's documentation with 800 pages and 4,500-word average page length, the llms-full.txt is closer to 47 MB raw and 9 MB gzipped. The same 24 fetches per day pattern becomes 216 MB per day. That's still cheap from a pure egress cost perspective on Cloudflare or Vercel, but it amplifies fast if you don't have a CDN in front of the origin. The Cloudflare 2026 AI crawler analysis showed that some documentation sites were seeing 3.2 GB per day of llms-full.txt egress before they enabled edge caching, which on commodity origin bandwidth is enough to trigger billing alerts.
The cache strategy matters as much as the file size. The conventional pattern is to serve llms.txt with a short cache TTL (5 to 60 minutes) because the link list is structural and changes infrequently, and to serve llms-full.txt with a longer cache TTL (1 to 24 hours) because regenerating the full content concatenation is expensive but the content changes less than the link structure. Cloudflare's auto-generation feature (released in March 2026) handles this automatically by regenerating the files in a Workers cron job and serving them from edge cache, so the origin is never hit by crawler traffic for these files.
Compression and Format Choices
Brotli compression beats gzip for both files. Across our test set, brotli reduced llms-full.txt payloads by an additional 12 to 18 percent compared to gzip. Most AI crawlers (ChatGPT-User, PerplexityBot, ClaudeBot) send Accept-Encoding headers that include brotli, so serving brotli when available is a free win. The exceptions are some older training-corpus scrapers that only accept gzip, which is why the conventional setup serves both and lets content negotiation pick.
Plain markdown is the right content type. Some sites have experimented with serving JSON or YAML versions of the same data, but neither format has any crawler support and adds parsing complexity without benefit. The MIME type that crawlers expect is text/markdown, with text/plain as a fallback. Returning application/json triggers parser confusion in some crawler pipelines and drops the file from ingestion.
Build-Time vs Runtime Generation
The generation strategy splits along the same lines as the rest of modern web infrastructure: build-time for sites with stable content sets, runtime for sites with dynamic or personalized content. Each has tradeoffs that compound over time.
Build-time generation runs as part of the CI pipeline when content changes. The generator reads the content directory or sitemap, produces both files, and ships them to the static asset host alongside the rest of the build. The files are then served from the CDN with no origin involvement at request time. The advantages are simplicity, low operational cost, and guaranteed consistency between the published files and the rest of the site. The disadvantage is that the files become stale between builds, which matters for sites that publish content faster than they rebuild (news sites, community-driven docs, ecommerce catalogs).
Runtime generation produces the files on demand or via scheduled jobs. The generator runs in a serverless function, a Cron worker, or a backend service, and either generates the files into a cache on a schedule or generates them per request with caching. The advantages are real-time accuracy and the ability to personalize the file per crawler if you want to serve different views to different consumers. The disadvantages are operational complexity, higher cost, and the risk of generation failures producing stale or empty files at runtime.
The right choice for most teams is build-time generation with hourly or daily rebuilds triggered by content webhooks. Cloudflare Workers and Vercel both support this pattern natively. The exception is documentation sites with very active changelogs (API references that update with every release), where the rebuild trigger needs to fire on every merge to the docs branch.
A Numbered Playbook for Generating Both Files
1. Audit your content set. Before writing any generator code, identify which URLs on your site you want LLMs to know about. The full sitemap is rarely the right answer because it includes paginated archives, tag pages, and other low-value chrome. The right input is your canonical content URLs only — the pages a human reader would consider the primary content of the site. For a docs site that's the docs pages. For a SaaS marketing site that's the product pages, pricing, and any thought leadership. For a blog that's the actual posts, not the category indexes.
2. Write the llms.txt generator. The generator reads the audited URL set, fetches each URL's title and short description (from frontmatter or from the page HTML), and writes them as a markdown link list grouped by section. The output is small and fast to generate. The structure should mirror the site's information architecture — if your docs have categories like "Getting Started," "API Reference," and "Guides," your llms.txt should have H2 sections with those same names. Validate the output against the original spec which provides a reference parser.
3. Write the llms-full.txt generator. The generator iterates over the same URL set, fetches each page's markdown source (or converts the rendered HTML to markdown via Turndown or similar), strips YAML frontmatter, rewrites relative links to absolute, and concatenates with horizontal rule delimiters. Include a canonical URL comment and an H1 title at the top of each document section so the LLM can parse where each document begins. Compress the output with brotli or gzip before serving.
4. Configure the cache headers and CDN. Serve llms.txt with Cache-Control max-age=300 (5 minutes) at the edge and max-age=86400 (24 hours) on the CDN. Serve llms-full.txt with max-age=3600 (1 hour) at the edge and max-age=604800 (7 days) on the CDN with stale-while-revalidate for graceful degradation. Set the Content-Type to text/markdown and the Content-Encoding to br when available. Add ETag and Last-Modified headers so crawlers can issue conditional GET requests and save bandwidth on unchanged content.
5. Add robots.txt allow rules per crawler. The default robots.txt should allow ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, and DuckAssistBot to fetch both files explicitly. Block GPTBot from llms-full.txt if you do not want your content in the next training corpus (most companies should). Block Google-Extended from both files if you have not opted into Google's LLM training data use. Test the robots.txt with Google's robots.txt tester and Cloudflare's bot management tools to verify the rules apply correctly.
6. Set up monitoring. Track the request rate, response size, and cache hit rate for both files per crawler. Set alerts for response time spikes (origin slowdown), cache miss rate spikes (cache invalidation issues), and total egress bandwidth (cost overrun). Track the citation rate in ChatGPT, Claude, and Perplexity for the pages referenced by your llms.txt to measure whether the artifact is moving the needle.
7. Iterate on what gets included. After 30 to 60 days of data, review which sections of llms.txt are driving citations and which are noise. Remove or downweight low-value sections. Add new sections for content that's getting cited from other channels. Treat the file as a living asset that compounds in value as you tune it, not a one-time deliverable.
Selective Crawler Strategy in robots.txt
The robots.txt strategy for llms files is where most teams under-think the deployment. The default behavior of most static site generators is to allow all crawlers everywhere, which means GPTBot, Google-Extended, Bytespider, and every other LLM training crawler gets your full content. That's fine if you want to be in training data; it's a competitive disaster if your content is the product.
The 2026 convention that has emerged across documentation sites is a three-tier strategy. Allow real-time citation crawlers (ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, DuckAssistBot) to fetch both files because they generate referral traffic. Disallow LLM training crawlers (GPTBot, Google-Extended, Anthropic-AI, Bytespider, Applebot-Extended) from llms-full.txt because they ingest the content without sending traffic back. Allow them to fetch llms.txt so they can at least discover the URL structure but force them to fetch each canonical URL individually if they want the content, which gives you per-URL crawl logging and the option to block specific URLs later.
The implementation in robots.txt is straightforward but requires per-user-agent blocks. The pattern is to list each crawler explicitly with its own allow/disallow directive set rather than relying on User-agent wildcards. Crawlers respect their specific user agent block over the wildcard block, so the granularity is necessary for the policy to actually take effect.
The cost of this strategy is operational complexity in maintaining the robots.txt as new crawlers emerge (and they emerge constantly). Cloudflare's AI Audit feature (released February 2026) auto-generates robots.txt rules based on detected crawler behavior, which removes most of the manual maintenance burden. Vercel's similar feature ships rule templates that teams customize per deployment.
Open Source Code and Documentation Use Cases
For projects shipping open source code, the llms.txt and llms-full.txt artifacts have a different cost profile than for commercial sites. Open source projects generally want their documentation in training corpora because the project's success correlates with adoption, and LLMs that know your library generate more code that uses it. The implication is that open source projects should ship llms-full.txt aggressively and allow training crawlers to consume it.
The GitHub topic page for open source code lists over 290,000 repositories, and a small but growing subset have started shipping llms-full.txt artifacts pointing crawlers at their documentation. The pattern emerged in 2025 with shadcn/ui, which shipped an llms-full.txt of its component documentation and saw measurable uplift in ChatGPT and Claude generating correct shadcn code in response to prompts. By Q1 2026, the shadcn/ui repository reported that the artifact was fetched 47,000 times per week by AI crawlers and correlated with a 23 percent increase in citation rate measured by GitHub stars driven by AI-generated code suggestions.
Other open source projects have followed the same pattern. The Mintlify CLI auto-generates the files for any docs deployment. Docusaurus shipped an official plugin in February 2026. Astro Starlight added native support in March 2026. The barrier to entry is now low enough that most maintained open source documentation sites can ship both files in under an hour of integration work.
The defensive variant is also worth mentioning. A few open source projects have started shipping llms.txt that lists deprecated or legacy documentation as "not recommended" with explicit notes that LLMs should avoid generating code based on those sections. The mechanism is that the markdown can include arbitrary text alongside the link, and many of the modern citation crawlers parse the surrounding text as context. Whether this actually changes LLM behavior is unproven, but the early signal from a handful of test cases suggests it has a small directional effect.
If you publish a developer github code reference for an open source project, the conventional choices are: publish llms-full.txt aggressively, allow all major crawlers including training crawlers, include version metadata in the file so consumers can detect staleness, and integrate the generation into the release pipeline so the file ships with every tagged release. The bandwidth cost is trivial for most open source projects (GitHub Pages or Vercel handles it free), and the upside is direct exposure to the next generation of LLM-assisted development.
Real-World Deployment Patterns
Examining what specific companies actually ship reveals the deployment patterns that work in production. The following snapshot is from May 2026 audits of public llms.txt and llms-full.txt artifacts at companies whose deployments are visible.
Anthropic ships both files at the docs.anthropic.com domain. The llms.txt is structured by API reference, guides, and examples. The llms-full.txt is 47 MB raw, 9 MB gzipped, and updated on every docs build (multiple times per day). The robots.txt allows all real-time citation crawlers and blocks GPTBot from llms-full.txt because Anthropic does not want its API docs in OpenAI training data. The deployment uses Vercel's auto-generation feature and edge caching, so origin traffic for these files is zero.
Stripe ships both files at docs.stripe.com. The llms.txt is structured by product area (Payments, Connect, Billing, etc.). The llms-full.txt is 38 MB raw, 7 MB gzipped, and rebuilt daily. The robots.txt allows all major crawlers including training crawlers because Stripe's strategy is "be the default payment integration LLMs suggest." The bet has paid off: ChatGPT and Claude generate Stripe integration code at roughly 4x the rate they generate competitor code in 2026 benchmarks.
Mintlify ships both files for its own marketing site and ships them as a built-in feature for every customer deployment. The customer deployments aggregate to a corpus of roughly 12,000 documentation sites worldwide that automatically ship the files. The aggregate crawler traffic to these files was reported at 380 million requests per month in Q1 2026, which gives Mintlify unique visibility into AI crawler behavior across a large corpus.
Cloudflare ships both files at developers.cloudflare.com. The llms-full.txt is 23 MB raw and updated hourly. The interesting deployment detail is that Cloudflare uses its own Workers product to generate the files at the edge, which means the artifact is both a product feature and a dogfood demonstration. The robots.txt is permissive because Cloudflare's product positioning rewards LLM citations.
For comparison, OpenAI's own developer docs ship llms.txt but not llms-full.txt as of May 2026. The official explanation in the OpenAI developer community thread is that OpenAI considers the full concatenation pattern to be an inefficient ingestion mechanism and prefers crawlers to fetch individual pages. The practical effect is that ChatGPT generates OpenAI API integration code with slightly higher hallucination rates than Anthropic API integration code, because the Claude docs are easier to ingest in bulk and end up with stronger model priors.
The pattern across these deployments is that the decision to ship both files versus just one depends on whether your business benefits from LLM ingestion of your full content. Docs-first developer companies benefit and ship both. Companies whose content is the product (research firms, paywalled publications, competitive IP) ship neither or ship only llms.txt.
Integration with Sitemap and RSS
The llms.txt artifact is complementary to rather than a replacement for sitemap.xml and RSS feeds. The three artifacts serve different consumer types and have different optimal structures. The sitemap is for traditional search crawlers (Googlebot, Bingbot) and includes every indexable URL with lastmod timestamps and priority hints. The RSS feed is for syndication and serves both human readers (via feed readers) and increasingly LLM training pipelines that ingest RSS as a structured update stream. The llms.txt is for AI crawlers and is curated rather than exhaustive.
Teams running sophisticated AEO programs typically ship all three. The sitemap segmentation strategy guide covers how to structure XML sitemaps for AI crawler priority. The RSS feed as LLM training corpus piece covers how to expose update streams for ingestion. The llms.txt as new robots.txt overview covers the foundational spec. Together they form a complete distribution layer for AI search, where each artifact handles the slice of the crawler ecosystem it's best suited for.
The maintenance burden of running all three is lower than it sounds because the underlying content source is shared. A single content management system or static site generator can produce sitemap.xml, RSS, llms.txt, and llms-full.txt from the same source data with appropriate templates. The cost is in the initial pipeline setup, not in ongoing maintenance.
For open source projects publishing developer documentation, the integration also extends to the GitHub repository itself. Listing the llms.txt URL in the repository README and linking to it from the project's documentation homepage makes the artifact discoverable to humans evaluating whether the project supports AI-assisted development. The convention is to add a small "AI-friendly docs" badge in the README linking to llms.txt, which signals to potential contributors that the project takes AEO seriously. The open source contribution AEO strategy walks through how this badge tactic affects developer authority signals.
Measuring Impact and Iterating
The hard part of llms.txt deployment is measuring whether it actually moved citation rates, because the attribution is indirect. The file doesn't change on a per-query basis, so you can't A/B test it within a single audience. The best you can do is before/after analysis with a clean cutover and a control variable (typically a competitor site that hasn't deployed the file).
The metrics that matter are crawler fetch rate (how many distinct AI crawlers are pulling the file weekly), per-page citation rate in major AI search engines (ChatGPT, Claude, Perplexity, Gemini), referral traffic from AI search to the URLs listed in llms.txt, and bandwidth cost (you want this trending down per citation, not up). Tools like Profound, Otterly, and Peec.ai have started shipping llms.txt-aware analytics in 2026 that correlate file changes with citation rate changes. The internal alternative is to log crawler fetches and cross-reference with citation tracking from the same tools.
The iteration loop is monthly or quarterly. Each cycle, review which sections of llms.txt are getting cited at higher rates and double down on them. Remove or rewrite sections that consistently underperform. Test format changes (link order, section grouping, summary length) with one variant at a time. The compounding effect over 12 to 18 months can be significant. The Mintlify 2026 documentation citation study reported that sites which iterated on their llms.txt structure quarterly saw 2.3x the citation rate uplift of sites that shipped once and ignored the artifact.
The pitfall to avoid is treating llms.txt as a one-time technical deliverable. The format is simple enough that the temptation is to write it once and forget it. The teams seeing real citation rate improvements are treating it as a content asset on the same maintenance cadence as their marketing site copy, with quarterly reviews, structured testing, and explicit ownership inside the content or developer relations team.
Takeaway: The llms.txt spec is settled enough in 2026 that not shipping it is a missed-opportunity cost, but shipping it wrong is worse than not shipping it at all. The right deployment depends on your business model: docs-first developer companies and open source code projects should ship both files aggressively and allow training crawlers. SaaS marketing sites and ecommerce stores should ship llms.txt only and block training crawlers from llms-full.txt. Bandwidth math matters more than the spec details — cache aggressively, compress with brotli, and monitor egress per crawler. Treat the file as a living content asset, not a one-time technical deliverable. The teams seeing 2x citation rate uplift in 2026 are the ones iterating quarterly with structured testing, not the ones who shipped once and walked away. The artifact is cheap to produce and cheap to maintain, and the downside of getting it wrong is mostly competitive leak, which is manageable with proper robots.txt segmentation.
Frequently Asked Questions
What is the difference between llms.txt and llms-full.txt?
llms.txt is a curated table of contents in markdown that points crawlers to the most important URLs on your site, usually one to three hundred lines long. llms-full.txt is the full concatenated body of every page or doc listed in llms.txt, often running tens of megabytes for documentation-heavy sites. The split emerged in late 2024 and early 2025 after Jeremy Howard's original llmstxt.org proposal, when developers realized one artifact could not serve both purposes. llms.txt optimizes for navigation and discovery, costs almost nothing in bandwidth, and lets crawlers selectively fetch the canonical URL of each section. llms-full.txt optimizes for one-shot ingestion by an LLM during retrieval or fine-tuning, costs a lot in bandwidth, and reveals your entire content corpus in a single fetch. Most modern adoption ships both files side by side.
Should I publish llms-full.txt or just llms.txt?
Publish llms.txt for almost every site. Publish llms-full.txt only if you have a defensible reason to give LLMs your entire content in one request, typically because you are documentation-first, open source, or actively trying to be cited and ingested. If your content is competitive intellectual property, behind paywalls, or expensive to crawl, skip llms-full.txt entirely and let crawlers fetch individual canonical URLs through llms.txt instead. Anthropic, Mintlify, and Cloudflare ship both files for their docs because their business model rewards LLM citations of their developer documentation. SaaS marketing sites and ecommerce stores typically should not ship llms-full.txt because they have no upside from giving the full corpus to crawlers in one shot.
Does llms.txt actually affect AI search citations in 2026?
The signal is positive but weaker than the marketing claims suggest. Cloudflare's 2026 crawler data shows ChatGPT, Perplexity, and Claude crawlers fetch llms.txt on roughly 31 percent of sites that publish it, up from 8 percent in mid 2025. Sites that ship both llms.txt and llms-full.txt see crawl efficiency improvements of 14 to 22 percent measured as crawler bandwidth per indexed URL. Whether this translates to citation rate uplift depends on the underlying content. A 2026 study of 3,200 documentation sites by Mintlify found a 6 to 11 percent increase in citation rate after shipping llms.txt, controlling for other variables. The mechanism is not magic; the file just makes the canonical URL set discoverable and reduces wasted crawl on navigation chrome.
How do I generate llms.txt and llms-full.txt for my site?
Use a static site generator plugin if you have one, or write a build-time script that reads your sitemap and content directory and concatenates the relevant fields. Mintlify, Docusaurus, and Nextra all ship plugins that produce both files automatically as part of the docs build. For custom sites, the pattern is straightforward: parse your sitemap.xml or content tree, extract each page's title and canonical URL, write those to llms.txt as a markdown link list, then optionally fetch each page's markdown source and concatenate it to llms-full.txt with a clear delimiter. Run the generation step in CI so the files stay synchronized with the published content. Cloudflare Workers and Vercel both offer auto-generation features as of early 2026 that build the files at the edge without requiring custom code.
Can publishing llms-full.txt hurt my search rankings or crawl budget?
It can hurt crawl budget if you serve it incorrectly. The file itself does not affect Google search rankings because Googlebot does not currently use llms.txt for indexing. The risk is bandwidth amplification: if llms-full.txt is twenty megabytes and forty different AI crawlers fetch it daily, you are serving 800 megabytes per day of cold cache traffic from your origin. The mitigations are CDN caching with long TTLs, gzip or brotli compression which typically reduces text payload by 75 to 85 percent, and selective robots.txt rules that allow specific crawler user agents while blocking others. The other risk is competitive intelligence leak: shipping your entire content corpus in one file makes it trivial for competitors to download and analyze your full information advantage.