SignalFeed

llms.txt Is the New robots.txt: What AI Crawlers Actually Do With It

The llms.txt proposal exploded across hacker forums and SEO Twitter in 2025. By mid-2026, every serious publisher has one. The catch: most of them are configured wrong, and the major AI labs are not reading them the way teams assume.


In September 2024, Jeremy Howard, the co-founder of fast.ai and Answer.AI, proposed a one-page specification for a file called llms.txt. The pitch was simple: publishers should be able to expose a curated, plain-text, LLM-friendly view of their site so that AI systems could quickly find the pages most worth citing.

The proposal landed at exactly the right time. Through 2024 and into 2025, publishers were watching organic traffic shift toward AI answers and feeling powerless to influence what those answers said. llms.txt felt like leverage. By mid-2025, Anthropic's documentation, Stripe, Cloudflare, Vercel, and hundreds of other developer-focused sites had shipped llms.txt files. SEO tools added llms.txt audits. Content marketers started writing how-to guides.

By May 2026, the situation has matured into something more useful and more confused. The format is everywhere, but the practical questions are still open: which AI systems actually read these files, what do they do with them, and how should publishers configure them to get anything back?

This piece walks through the answers as they stand today. The honest version, not the marketing version.

What llms.txt Actually Is

llms.txt is a plain-text Markdown file that lives at the root of a domain, at the path /llms.txt. Inside, the file uses Markdown headings and link lists to point AI systems at the pages the publisher considers most important.

A minimal valid file looks like this.

``` # Example Company

> Example Company builds developer tools for AI applications.

Documentation

- Getting started: Five-minute setup. - API reference: Complete API documentation. ```

The format is intentionally simple. There are no required fields beyond a top-level title, no XML schema to validate against, no permissioning syntax. The goal is not to instruct AI systems on what to do; it is to give them a curated path through the site.

A second file, /llms-full.txt, is often published alongside the main file. This second file concatenates the full cleaned content of the linked pages so that an AI with a long context window can ingest the entire authoritative corpus in a single fetch. Larger publishers split this into multiple files by section.

How It Differs from robots.txt

The most common misconception is that llms.txt is a successor to robots.txt. It is not.

robots.txt is a permissioning file standardized by the IETF in RFC 9309. It tells compliant crawlers which paths they may fetch and which user agents are blocked. When you block GPTBot or CCBot in robots.txt, the major operators of those crawlers respect the directive.

llms.txt is a curation file. It does not block or allow anything. It signals priority and structure. A publisher can — and often should — use both files. robots.txt handles permissioning ("do not crawl /admin"). llms.txt handles curation ("when you summarize this site, here are the pages worth quoting").

Conflating the two leads to the most common configuration mistake: publishers using llms.txt to try to block AI summarization, which it cannot do, while leaving GPTBot allowed in robots.txt.

FilePurposeStandardWho enforces
robots.txtPermission control for crawlersRFC 9309, since 1994Compliant crawlers
sitemap.xmlURL discovery for indexingsitemaps.org specSearch engines
llms.txtCuration hints for AI systemsCommunity proposal (2024)AI crawlers (voluntary)
llms-full.txtFull-text corpus for LLM ingestionCommunity proposalAI crawlers (voluntary)

Which AI Crawlers Actually Read It

This is the question every operator wants answered honestly.

Anthropic. Anthropic has publicly noted that Claude's fetchers consider llms.txt during web retrieval. Server logs from publishers that ship the file show Claude-User and other Anthropic agents requesting /llms.txt during retrieval-grounded queries. The file is not a direct training input, but it appears to inform citation decisions when Claude browses the web in response to a query.

Perplexity. Perplexity has indicated support and crawls the file. Independent observation of Perplexity's citation behavior on sites with strong llms.txt files suggests it shifts which pages get cited on the source's domain, though the effect is modest and inconsistent across queries.

OpenAI. OpenAI's fetchers — GPTBot, OAI-SearchBot, ChatGPT-User — do request /llms.txt from compliant sites. OpenAI has not publicly confirmed how the file is used. Independent analyses of ChatGPT browsing behavior find that pages already prominent in OpenAI's training data and in Bing's index dominate citations, with llms.txt as at most a tiebreaker.

Google. Google has been the clearest: AI Overviews and AI Mode use the same retrieval foundation as Google Search. There is no separate AI index, and llms.txt is not part of the documented requirements for inclusion. See Signal's analysis of Google's AI Overview ranking signals for the broader picture.

Smaller AI products. Vertical AI tools, code assistants, and research-focused agents are the most enthusiastic consumers of llms.txt. Many use it as a first-class signal because they lack the index scale of the major labs.

The realistic conclusion: llms.txt is read by some systems some of the time. Treating it as a guaranteed visibility lever sets the wrong expectation. Treating it as a low-cost hint that improves the odds of being cited correctly is closer to the truth.

The Configuration Most Publishers Get Wrong

Auditing roughly 200 llms.txt files in May 2026 reveals a small number of recurring mistakes.

The first is dumping the sitemap. Many publishers generate an llms.txt that simply lists every URL on the site. This defeats the purpose of curation. The file should highlight a manageable number of high-value entries, not exhaustively enumerate every page.

The second is missing descriptions. The Markdown link format allows a short description after each link. Many files omit them. Without a description, an AI system has no signal about why the page matters or what it covers, which reduces the file's curation value to roughly zero.

The third is stale content. Files that link to deprecated documentation, expired blog posts, or renamed product pages signal that the publisher does not maintain the file. AI systems that surface stale content based on a stale llms.txt produce poor user experiences, which in turn reduces trust in the source.

The fourth is omitting the freshness section. The best-performing files include a clearly labeled "Recent" or "Updated" section that lists the most current authoritative pages. AI systems trying to answer time-sensitive queries can use this section to prefer fresh sources over older ones.

The fifth, and most consequential, is treating llms.txt as a substitute for everything else. A great llms.txt file on a site with broken sitemaps, missing structured data, blocked crawlers, or thin content does not produce visibility. llms.txt is one component of an AI-friendly content stack, not a replacement for it.

The Five-Step Configuration Playbook

For teams shipping or auditing llms.txt today, the following sequence covers the high-leverage work.

1. Confirm crawl access first. Before optimizing curation, audit robots.txt and CDN rules to ensure the AI crawlers you want to reach are allowed. Many teams discover during this step that they have inadvertently blocked GPTBot, ClaudeBot, or Perplexity at the CDN layer. Fix this before anything else.

2. Map the canonical pages. Identify the 20 to 60 pages on the site that you most want AI systems to cite. These are typically your authoritative product pages, your definitive guides, your pricing, your documentation, your case studies, and a small number of recent thought leadership pieces. Avoid soft marketing pages and stale content.

3. Write the file by hand. Auto-generated llms.txt files almost always fail the description and curation tests. The file is short enough to maintain manually. A human-edited file with intentional descriptions consistently outperforms an auto-generated one.

4. Publish llms-full.txt for documentation-heavy sites. If your domain has a documentation corpus that an AI system would benefit from ingesting in one pass, publish the full cleaned content at /llms-full.txt. For sites without a documentation core, this file is optional and often skipped.

5. Validate, deploy, and re-audit quarterly. Use a Markdown linter to confirm valid syntax. Verify the file is served with content-type text/plain or text/markdown and returns a 200 response. Schedule a quarterly audit to refresh links, remove deprecated pages, and update the Recent section.

The whole process for a mid-sized site takes a few hours. Done poorly, it is worse than nothing. Done well, it is a small but real signal in the AI visibility stack.

What llms.txt Cannot Do

The expectation gap is large enough that it is worth stating bluntly what llms.txt does not do.

It does not block AI training. If you do not want your content used to train models, you need provider-specific opt-outs in robots.txt and, for some providers, account-level controls. llms.txt is not a permissioning file.

It does not guarantee citation. AI systems decide citations based on many signals: ranking, freshness, trust, authority, query relevance. llms.txt is at best one input among many.

It does not improve Google AI Overview visibility through any documented mechanism. Google's stated guidance is that AI Overviews use Search's existing ranking foundation. See Signal's piece on AEO, GEO, and SEO terminology for how the labels interact.

It does not fix thin content. If the underlying pages do not deserve to be cited, a curated index of those pages will not change AI citation behavior in any meaningful way.

It does not replace structured data. Schema, Open Graph, and standard meta tags do work that llms.txt does not address. Both layers matter.

The Reasonable Investment Level

Given the asymmetric value, the right level of investment is small but real.

A mid-sized SaaS site should ship a hand-edited llms.txt file in a few hours, audit it quarterly, and integrate the audit into the existing content operations rhythm. Total annualized effort is a handful of hours.

A documentation-heavy site should also ship llms-full.txt and integrate generation into the docs build. Total effort is one to two engineering-days up front, then a few hours per quarter.

A site with no documentation core, thin content, and weak structured data should fix those problems first. Adding llms.txt to a site that does not deserve to be cited is theater.

The most common misallocation is teams spending a week building elaborate llms.txt tooling while their structured data is broken, their sitemaps are stale, and their best content lacks clear authorship signals. That is the wrong sequencing. Foundational SEO, structured data, and content quality are higher-leverage. llms.txt belongs near the end of the checklist, not the beginning.

What Comes Next

Three developments are worth watching through 2026.

The first is whether OpenAI or Google publicly commits to making llms.txt a documented signal. If either does, the format gains a step change in practical importance. If neither does, llms.txt remains a useful-but-modest hint.

The second is whether the spec itself evolves. The current proposal is minimal. Extensions for content licensing, citation preferences, and machine-readable freshness signals are all under discussion in the broader community. A v2 of the spec is plausible by end of 2026.

The third is how AI-first publishers integrate llms.txt into their broader content operations. The teams treating it as a serious editorial artifact — with named owners, quarterly reviews, and connection to analytics — are setting the pattern for what mature AI-content operations look like. The teams treating it as a marketing checkbox will have nothing to show for the effort. See Signal's analysis on trust signals for AI search for how llms.txt fits into the broader trust stack.

Takeaway: llms.txt is a real and useful primitive, but it is not the silver bullet some early adopters hoped for. The file is voluntary, the major AI labs consume it inconsistently, and a poorly maintained file is worse than no file at all. The right approach is to publish a hand-edited file, keep it short, refresh it on a quarterly cadence, and integrate it into a broader AI-friendly content stack that also includes crawlable HTML, accurate structured data, comprehensive sitemaps, and credible authorship. Treat llms.txt as the new sitemap.xml — important to do well, dangerous to over-rotate on, and most valuable when it is one piece of a larger system.

Frequently Asked Questions

What is llms.txt and where does it live on a site?

llms.txt is a plain-text Markdown file proposed in 2024 by Jeremy Howard as a way for websites to expose curated, LLM-friendly summaries of their most important content. The file sits at the root of a domain, at the path /llms.txt, in the same location as robots.txt and sitemap.xml. Inside, the file uses Markdown headings and link lists to nominate the pages a publisher most wants AI systems to surface or cite. A companion file, /llms-full.txt, is sometimes published with concatenated cleaned content of those pages so an LLM with a long context window can ingest the full corpus in one fetch. The proposal is not a W3C standard and has no enforcement mechanism, but its simplicity made adoption fast among technical sites in 2025.

Do ChatGPT, Claude, Perplexity, and Google's AI features actually read llms.txt?

As of May 2026, the picture is uneven. Anthropic has publicly acknowledged that Claude's web fetcher considers llms.txt as one signal among many when summarizing a domain. Perplexity has discussed using llms.txt to improve citation quality. OpenAI and Google have been less explicit. Independent crawl analyses from sites like Common Crawl and from publisher logs show that the major AI labs' fetchers do request llms.txt when crawling a domain, but no lab has confirmed that the file is a primary input to training or to retrieval-augmented generation. The honest summary is that llms.txt is a low-cost hint, not a guaranteed ranking lever. Publishers who treat it as either are setting themselves up for misallocated effort.

How is llms.txt different from robots.txt?

robots.txt is a permissioning file. It tells crawlers which paths they are allowed to fetch and which user agents are blocked. It is a directive that compliant crawlers respect. llms.txt is a curation file. It does not block or allow anything. It tells AI crawlers which pages on the site the publisher considers most important and well-suited for citation or summarization. The two files coexist. A site can use robots.txt to block GPTBot from a paywall, then use llms.txt to curate which open-access pages it wants surfaced. Treating llms.txt as if it were robots.txt — for example, using it to block crawlers — is a common configuration mistake.

What should publishers put in llms.txt and what should they leave out?

The strongest pattern is to publish a short Markdown file with a one-paragraph site overview, a Quick Links section pointing to the most-cited canonical pages, a Documentation or Knowledge Base section grouping evergreen content, and a Recent Updates section with the freshest authoritative pieces. Each entry should be a Markdown link followed by a short description. The file should be under a few hundred lines so an AI system can ingest it cheaply. Pages to leave out include thin marketing landing pages, dated promotional content, pages that duplicate other content, and any URL the publisher would not want quoted out of context. A messy llms.txt is worse than no file because it signals low editorial quality to the systems that do consume it.

Does llms.txt help with AI Overviews, AI Mode, or Perplexity citations?

Google's documentation for AI Overviews and AI Mode does not list llms.txt as a requirement, and Google has stated that the same SEO foundations that drive Search drive AI features. So llms.txt is unlikely to be a direct ranking input for Google's surfaces. For Perplexity and Claude, llms.txt appears to be one of many crawl-time signals, and publishers who maintain a clean file may see modest citation lift over time. The realistic expectation is that llms.txt becomes part of a broader AI-friendly content stack — clean HTML, accurate structured data, comprehensive sitemaps, and llms.txt — rather than a single lever that materially changes visibility on its own.

Will llms.txt eventually become an official standard?

There is no W3C or IETF working group adopting llms.txt as of mid-2026. The proposal remains a community standard maintained on its original spec page and a handful of GitHub repositories. Anthropic, Perplexity, and several smaller AI companies have publicly endorsed the format. Google and OpenAI have not committed to making it canonical. If the proposal does formalize, it is likely to happen the way sitemap.xml did: through enough industry adoption that the major search and AI vendors collectively agree to a stable schema. Publishers should treat the current spec as stable enough to implement, while expecting that conventions and best practices will continue to evolve.