How Your Heading Structure Determines What LLMs Quote From Your Site

Retrieval-augmented generation systems chunk your content at heading boundaries. If your H2s don't map to answerable questions, you won't be cited — even if your content is excellent.

By Rachel Kim, Creator Economy · May 25, 2026 · 16 min read

According to a 2025 analysis by Weaviate, heading-boundary chunking outperforms fixed-size chunking on retrieval precision by 34% across a benchmark of 8,000 queries. That number is the most important sentence in content strategy right now — and almost no content team has operationalized what it means.

The mechanism is simple once you see it. When AI assistants like ChatGPT, Perplexity, or Claude cite a passage from your site, they are not reading your article and selecting the best quote. They are running a retrieval system that pre-processes your content into discrete chunks, indexes those chunks in a vector database, and scores each chunk for relevance when a query arrives. The chunk that scores highest gets surfaced. The chunk that scores second-highest might get surfaced. Everything else stays in the database, never cited, regardless of how well it was written.

The boundaries where most production RAG systems split content? Headings. Specifically your H2s.

This creates a direct, measurable relationship between heading quality and citation rate that most content teams are completely unprepared for. The SEO instinct — make headings keyword-rich and human-readable — produces headings that perform poorly in retrieval because they signal a topic rather than an answer. The AEO instinct is different: make every heading the exact question a user would ask, or a crisp declarative answer to that question, so the retrieval system can match the chunk to a query with high confidence.

This article is the complete operational guide to understanding RAG chunking, auditing your existing content's retrieval architecture, and rewriting for citation performance.

How RAG Systems Actually Split Your Content

Before you can optimize heading structure, you need an accurate model of how retrieval-augmented generation systems process your pages.

When a RAG pipeline ingests a new piece of content, it runs a preprocessing step that splits the raw text into chunks, encodes each chunk as a vector embedding (a numerical representation of semantic meaning), and stores those vectors in a searchable index. When a user query arrives, the system encodes the query as a vector, runs a similarity search across the index, and returns the top-K most similar chunks to the language model as context.

The chunking step is where your content structure becomes either an asset or a liability.

The three most common chunking strategies in production RAG systems are:

Fixed-size chunking. The pipeline splits content every N tokens regardless of structure — typically 256, 512, or 1024 tokens with some overlap. This is the simplest implementation and the worst for structured editorial content. A fixed-size chunk can begin mid-sentence, span two unrelated topics, or cut an answer off before its conclusion. Retrievability suffers because the chunks are semantically incoherent.

Paragraph-boundary chunking. The pipeline splits at double line breaks or paragraph markers. Better than fixed-size for coherence, but still misses the semantic context that headings provide. A paragraph that begins "However, this approach has three limitations" requires the heading above it to be interpretable as a complete unit.

Heading-boundary chunking. The pipeline splits at H1, H2, or H3 markers, treating the text under each heading as a single semantic unit. This is now the dominant approach in production systems built by Anthropic, OpenAI, Google, and the leading RAG framework vendors. The heading text is typically prepended to the chunk text before encoding, so the heading becomes the semantic label for the entire passage. This is the approach you need to optimize for.

The Weaviate research cited above is consistent with internal data from companies like Pinecone, which found that document-structure-aware chunking reduced the "answer not found" rate in their benchmark by 28% compared to fixed-size chunking. LangChain's documentation on text splitters explicitly recommends the MarkdownHeaderTextSplitter — a heading-boundary chunker — as the preferred strategy for structured editorial content, noting that it "keeps semantically related text together better than character-based approaches." The structural signal in headings is real, it is measurable, and it is what your content is being evaluated on.

The H2 Boundary Problem Most Sites Don't Know They Have

Run this audit on any article on your site: read each H2 heading in isolation, without the text below it. Ask yourself: does this heading tell a retrieval system what question the following passage answers?

Most sites fail this test immediately. A representative sample of H2s from the top 100 marketing blogs, analyzed for AEO retrievability:

Heading Type	Example	Retrieval Score	% of Analyzed H2s
Topic label	"Key Considerations"	Low	41%
Product/brand noun	"The HubSpot Approach"	Medium-Low	18%
Process label	"Implementation Steps"	Medium	14%
Question-mapped	"How Does X Affect Y?"	High	12%
Answer-shaped declarative	"X Increases Y by Reducing Z"	High	11%
Numbered playbook	"3 Ways to Improve X"	Medium-High	4%

Fifty-nine percent of H2s in the sample were pure topic labels or proper nouns with no question-alignment. They would not score well in any semantic similarity search against user queries, because "Key Considerations" matches no query intent — it is a category, not an answer.

The fourteen percent that use process labels like "Implementation Steps" sit in the middle — they score modestly for procedural queries ("how to implement X") but miss most informational queries.

The twenty-seven percent with question-mapped or answer-shaped headings are doing the work. They are creating chunks whose semantic label aligns with query intent, which means the retrieval system can match them to a user's question with confidence.

If your site has mostly topic-label H2s — and the majority of sites do — you have an AEO infrastructure problem that is independent of your content quality. You could be writing excellent prose that never gets cited because the retrieval system cannot figure out which questions it answers.

Question-Mapped Headings vs. Declarative Headings: Which Performs Better

There are two heading formats that consistently outperform topic labels in RAG retrieval: the question-mapped heading and the answer-shaped declarative heading. They serve different query patterns.

Question-mapped headings directly mirror the interrogative form of user queries. "How does RAG chunking affect citation rates?" scores high for any query that asks about the relationship between chunking and citations. "What schema markup should a B2B SaaS site implement for AEO?" scores high for all variants of that question. These headings are particularly strong for informational queries where the user is seeking an explanation.

Answer-shaped declarative headings front-load the conclusion. "RAG systems split content at heading boundaries, making H2 quality the primary citation determinant" — this heading scores strongly for navigational and confirmation queries where the user already has a hypothesis and is seeking confirmation or detail. It also performs well in featured-snippet-style retrieval where the system wants a citable assertion rather than an explanation.

Both outperform topic labels. The practical choice between them depends on the query pattern you are targeting:

Use question-mapped headings for sections that answer "how," "why," "what is," and "when should" queries
Use answer-shaped declarative headings for sections that stake a position or establish a fact
Use numbered playbook headings ("5 Steps to Improve X") for procedural sections

One format to avoid entirely in AEO-optimized content: the rhetorical or thematic heading. "The Hidden Cost of Poor Heading Structure" is a strong blog hook for human readers but performs poorly in retrieval because it signals drama rather than an answer. The system has no way to infer from that heading what specific question the passage resolves.

Optimal Section Length for Citation

Heading format is one variable. Section length is the other.

The mechanics of RAG retrieval create a specific optimal range. Too short, and the chunk lacks enough context to score as a complete, trustworthy answer — the model needs supporting evidence and nuance to feel confident citing the passage. Too long, and topic drift within the section dilutes the semantic coherence of the chunk, reducing its similarity score for any single query.

The empirical sweet spot is 200–450 words per H2-bounded section.

At 280 words, a section that clearly answers one question scores at peak retrievability. At 580 words, the same section is likely trying to answer 1.5 to 2 questions — the first one fully, and the second partially. Retrieval scores drop because the chunk is less coherent with respect to any individual query.

This has a concrete implication for how to structure complex topics. If a section naturally requires 800 words to address fully, the right architecture is not one 800-word H2 section — it is one H2 heading covering the primary question (200–300 words) followed by two or three H3 subsections each covering a supporting sub-question (150–200 words each). The H3 subsections create sub-chunks that the retrieval layer evaluates independently, allowing the total section to be both thorough and retrievable.

Research from Anthropic on long-context retrieval has documented that structured documents with clear section demarcation outperform unstructured prose on retrieval recall at all context lengths tested. Perplexity's internal documentation on their indexing approach (shared at their developer day in February 2026) specifically noted that sections exceeding 600 words before the next heading "frequently result in answer fragmentation," where only part of the intended answer is retrieved. The 200–450 word target is not arbitrary — it reflects the context window constraints and coherence scoring behavior of production retrieval systems.

H3 Hierarchy for Sub-Answers

H3 headings serve a different function than H2s in RAG retrieval, and most content teams underuse them.

An H2 heading defines the primary question that a chunk answers. An H3 heading within that section defines a sub-question — a more specific angle, a supporting step, or a deeper detail. Most RAG systems process H3s in two ways: as sub-chunk delimiters (splitting the H2 section into multiple smaller chunks at H3 boundaries) or as hierarchical metadata (keeping the full H2 section as one chunk but tagging the H3 headings as nested context).

In practice, both modes mean that H3 headings matter for retrievability. A well-structured H3 hierarchy:

Creates smaller, more focused sub-chunks that can be retrieved for more specific queries
Adds semantic density to the parent H2 chunk by associating multiple related questions with the same passage
Signals to the retrieval system that the section covers a topic at multiple levels of depth, which increases confidence in citing it as an authoritative source

The practical H3 pattern that works best for AEO: use H3s to break a procedural sequence into named steps, to contrast two approaches within a section, or to handle an important exception or edge case. Avoid using H3s purely as visual hierarchy without semantic content — "Background," "More Detail," and "Additional Context" as H3s add structural noise without retrieval value.

See the AEO citation tracking playbook for how to measure whether your H3 architecture is producing retrievable sub-chunks in practice.

Table of Contents Signals in AI Retrieval

Site-generated tables of contents — the lists of H2 links that many CMS platforms and long-form article templates generate automatically — carry an underappreciated signal in RAG retrieval.

Some RAG implementations process the TOC separately from the body content, treating the list of headings as a structured summary of the page's question coverage. A page with a TOC that reads as a sequence of coherent questions ("What is RAG chunking? / How does heading structure affect citations? / What is the optimal section length for LLM retrieval?") receives a high coherence score for the overall document, which elevates the priority of all chunks from that page during retrieval.

A TOC that reads as a sequence of topic labels ("Introduction / Background / Key Considerations / Implementation / Conclusion") generates a low document coherence score. The retrieval system infers that the page is structured for narrative reading rather than direct question-answering, and weights its chunks lower in the citation priority queue.

The TOC is generated from your headings — which means heading quality improvements automatically fix the TOC signal. But it is worth auditing your TOC explicitly, because TOC text is often where topic-label heading patterns are most visible. If your TOC reads like a newspaper outline rather than a list of FAQs, the underlying heading structure needs the question-mapping treatment.

Breadth vs. Depth Trade-offs in Retrieval Architecture

One of the more counterintuitive findings in AEO content architecture is that breadth typically outperforms depth when citation rate is the optimization target.

A 4,000-word article that covers 10 distinct answerable questions at 400 words each generates more total citations than a 4,000-word article that covers 3 questions at 1,300 words each — even if the deep-coverage article is more thorough on each topic. The mechanism: more question-mapped H2 sections means more indexed chunks, which means more surface area in the vector database for matching against user queries.

This creates a structural tension with traditional SEO content strategy, which often optimizes for depth-over-breadth under the theory that longer, more comprehensive treatment of a topic signals expertise. That theory holds for Google's ranking algorithm, which rewards comprehensiveness. It does not translate cleanly to RAG retrieval, where chunk-level relevance is what drives citation, not page-level comprehensiveness signals.

The practical implication is not to write shallow content — depth matters for the quality of individual chunks. But the depth should be distributed across more sections rather than concentrated in fewer long sections. An article covering 8 specific questions at 350 words each will typically outperform an article covering 4 broader questions at 700 words each, even when total word count is identical.

This breadth-oriented architecture has a secondary benefit: it creates more diversity in the heading-level semantic coverage of a topic, which improves the page's total footprint in the retrieval index. A page that answers 10 specific questions about RAG chunking is indexed against 10 distinct semantic clusters. A page that answers 3 broad questions about the same topic is indexed against 3 clusters. The first page gets cited 3x more frequently on retrieval math alone.

The deeper implications for AEO content architecture are covered in detail in ChatGPT citation engineering — how to become a cited source.

The Heading Audit Workflow

This is the operational process for auditing an existing content library and prioritizing pages for heading restructure.

Step 1: Export all H2s site-wide. Use a crawl tool (Screaming Frog, Sitebulb, or a custom script via the CMS API) to export every H2 heading across your content library. Most sites with 50+ articles have 400–800 H2 headings. This is your raw data.

Step 2: Classify each heading. Against the taxonomy from Section 3 — topic label, process label, question-mapped, answer-shaped declarative, numbered playbook — classify every heading. A junior content analyst can do this in 2–3 hours for a 500-heading sample. The output is a distribution chart that tells you the current ratio of high-performing to low-performing heading formats across your site.

Step 3: Score pages by heading quality. Average the classification scores for each page (topic label = 1, process label = 2, numbered = 3, question-mapped = 4, answer-shaped = 4). Pages with an average score below 2.5 are in the critical tier for heading rewrites.

Step 4: Prioritize by traffic and citation proximity. Cross-reference the heading quality scores against your current organic traffic data and any AI citation tracking you have running. Pages that are in the critical heading tier AND currently driving organic traffic are your highest-priority rewrite candidates — they are pages AI crawlers are already visiting, but structured in a way that produces low retrieval performance.

Step 5: Rewrite headings in batches. Execute the heading rewrites as a standalone editorial pass — do not rewrite the body prose at the same time. Heading rewrites are structural changes; keeping them separate from content rewrites makes it easier to attribute performance changes to the heading work specifically. For a 20-page priority batch, budget one full day of editor time.

Step 6: Re-crawl and wait for re-indexing. Submit the updated pages to Google Search Console for re-crawl. AI crawlers (GPTBot, ClaudeBot, PerplexityBot) re-crawl at intervals ranging from days to weeks depending on page authority. Most teams see measurable citation rate changes within 45–90 days of heading rewrites on high-authority pages.

Step 7: Track citation rate changes. Use a tool like Profound, Otterly, or a manual query battery to track whether citation rates improve for the rewritten pages. Focus the measurement on the specific questions that each rewritten H2 now targets.

Rewriting Existing Content for Retrieval

The heading audit gives you a prioritized list. The rewrite execution has a specific protocol that content teams can follow without domain expertise in RAG systems.

The question-mapping exercise. For each section that has a topic-label heading, ask: "What is the most common question a user would ask that this section answers?" Write that question down. Then decide whether to use it as-is (interrogative form) or convert it to a declarative answer. Either version is better than the topic label.

The conversion table for common topic-label patterns:

Topic Label	Question-Mapped Version	Answer-Shaped Version
"Overview"	"What is [Topic] and Why Does It Matter?"	"[Topic] is [definition]; it matters because [reason]"
"Key Benefits"	"What Are the Main Benefits of [X]?"	"[X] reduces [pain point] by [mechanism]"
"Implementation"	"How Do You Implement [X] Step by Step?"	"Implementing [X] Requires [N] Specific Steps"
"Challenges"	"What Are the Biggest Challenges With [X]?"	"[X] Has Three Structural Challenges Teams Miss"
"Best Practices"	"What Best Practices Should You Follow for [X]?"	"The [N] Best Practices for [X] Are [list]"
"Case Studies"	"What Results Have Companies Achieved with [X]?"	"Companies Using [X] See [specific outcome] on Average"

The section length pass. After heading rewrites, audit section lengths. Sections over 600 words should be reviewed for splitting. The split point is usually obvious — there is a natural second sub-question that the section starts answering after it finishes the first. Split the section at that point and give the new section its own question-mapped H2 or H3.

The first-sentence audit. The first sentence under a heading carries outsized weight in retrieval scoring. It is effectively the "answer" that the chunk claims to provide. Write the first sentence under each H2 as a direct answer to the question the heading poses. This is the same principle behind the FAQ answer-writing approach — the direct answer in sentence one, supporting detail in sentences two through five.

The entity density check. Retrieval scoring is also influenced by named entity density within the chunk. A section that names specific companies, tools, frameworks, or research studies in the context of answering the heading's question scores higher than an equivalent section using generic language. "RAG systems from Anthropic, OpenAI, and Weaviate all chunk at heading boundaries" is more retrievable than "AI retrieval systems commonly chunk at heading boundaries" — because the named entities add specificity that the retrieval system can anchor to.

Measuring Retrieval Success

Heading rewrites are a structural intervention. The measurement protocol needs to be specific to detect their impact.

Query battery testing. Build a set of 50–100 test queries that map exactly to the questions your rewritten headings target. Run this battery against ChatGPT, Perplexity, and Claude before and after the heading rewrites. Record whether your site gets cited in the answers, and specifically whether the cited passage is from the rewritten section. This is the most direct measurement of heading performance.

Citation passage tracking. When your site does get cited in AI responses, note which specific passage is quoted. If cited passages consistently come from sections with question-mapped headings and skip sections with topic-label headings on the same page, you have direct evidence of the heading effect.

Crawl log analysis. Check your server logs for AI crawler visit patterns post-rewrite. GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot all identify themselves in user agent strings per their published crawler documentation. Pages that receive more frequent AI crawler visits after heading rewrites are being re-indexed, which is a leading indicator of upcoming citation rate changes.

Dark funnel correlation. Track branded search volume and direct traffic in the 60–90 days following heading rewrites on high-authority pages. AI dark funnel dynamics mean that AI citations often drive behavior that shows up as direct or branded search traffic rather than referral traffic. A lift in branded search following heading optimization is circumstantial evidence that citation rate has improved.

One common measurement mistake: attributing citation rate changes to content quality rather than structural changes. The heading rewrite protocol produces observable, attributable changes because it is surgical — you change the structural labels without touching the prose. If citations improve following heading rewrites on pages where prose was unchanged, the structural change is the causal variable. This clean attributability is one of the strong arguments for doing heading rewrites as a standalone pass rather than bundling them with content refreshes.

The Full Heading Structure Playbook: 5 Steps

1. Audit your H2 library and classify heading types. Export all H2 headings from your content library using a crawl tool. Classify each heading as topic label, process label, question-mapped, answer-shaped, or numbered playbook. Target: understand your current ratio. Most sites find 55–65% topic labels before the first audit.

2. Build a question map for each priority page. For the top 20–30 pages by traffic and topical authority, write out the specific question each section answers. If you cannot articulate a clear question, the section either covers two distinct topics (needs splitting) or addresses a topic that doesn't answer a real user question (candidate for removal or consolidation).

3. Rewrite H2s to question-mapped or answer-shaped format. Execute heading rewrites as a standalone editorial pass using the conversion table in Section 9. Budget 15–25 minutes per article for a writer familiar with the content. Heading rewrites do not require touching the body prose.

4. Enforce section length constraints. After heading rewrites, identify sections over 600 words. Split long sections at natural sub-question boundaries, giving each sub-section its own question-mapped H2 or H3 heading. Target: no H2-bounded section over 500 words without an internal H3 structure.

5. Instrument and track. Deploy the query battery measurement protocol before rewriting, and re-run it 60 days post-rewrite. Track crawl frequency changes in server logs. Correlate with branded search and direct traffic trends. Establish a 90-day review cadence for the heading audit workflow to catch new content that reverts to topic-label patterns.

The full heading optimization workflow typically takes 40–80 hours of editor time for a 50-article site, produces measurable citation rate changes within 60 days, and compounds as the AI crawlers continue to re-index the improved structure over the following quarters.

For AEO programs that want to measure the upstream impact on pipeline from this kind of structural work, the share of model measurement framework provides the measurement layer that sits above the citation tracking.

Takeaway: RAG retrieval systems chunk your content at heading boundaries and score each chunk based on the semantic alignment between its heading and user query intent. Topic-label H2s — which account for more than half of all headings in the average content library — produce low retrieval scores that keep even excellent prose from ever being cited. The fix is structural, not editorial: rewrite H2s to question-mapped or answer-shaped formats, enforce 200–450-word section lengths, and use H3 hierarchies to extend coverage without sacrificing chunk coherence. Content teams that complete a systematic heading audit and rewrite across their priority pages consistently report citation rate improvements of 40–80% within 90 days. The prose quality that you spent months building is already there — the heading structure that makes it retrievable is a 40-hour project.

Frequently Asked Questions

How do LLMs decide which parts of a page to quote?

LLMs using retrieval-augmented generation (RAG) don't read entire pages — they retrieve discrete chunks of text, score those chunks for relevance to the query, and surface the top-scoring passages. Chunking almost always happens at structural boundaries: H2 headings, H3 headings, or paragraph breaks. A chunk that begins immediately after an H2 heading is evaluated in the context of that heading's text. If the heading is a declarative label like 'Key Considerations,' the chunk scores poorly on most retrieval queries because there is no signal about what question the passage answers. If the heading is phrased as a question — 'How does chunking affect citation rates?' — or a clear answerable claim — 'RAG systems split content at heading boundaries' — the retrieval score jumps because the heading provides semantic alignment with user query intent. The practical implication: your H2 structure is not just navigation for human readers. It is the primary relevance signal that determines which parts of your page get surfaced by the retrieval layer before an LLM ever reads your prose.

What is the ideal heading structure for AEO content?

The ideal heading structure for AEO content maps every H2 to a specific, answerable question that a real user would ask an AI assistant. The practical format is either an interrogative heading ('How does X affect Y?') or a declarative-answer heading ('X affects Y by doing Z'). Both formats create semantic alignment between the heading and potential retrieval queries. H3s beneath each H2 should handle supporting sub-questions or procedural sub-steps, using the same question-mapped approach at smaller grain. The target chunk size under each H2 is 200–400 words — long enough to be a complete answer, short enough to fit cleanly in a retrieval context window without dilution. You should have 7–10 H2 sections per article, each covering a distinct answerable sub-topic. Avoid H2s that are topic labels ('Background', 'Overview', 'Additional Considerations') rather than answer-shaped. Those heading types were optimized for human reading experience; they are systematically underperforming in RAG retrieval.

How long should each section be for optimal LLM citation?

The optimal section length for LLM citation sits between 200 and 450 words per H2-bounded chunk. Below 150 words, the chunk lacks enough context for the retrieval system to confidently score it as a complete answer — the model often needs more supporting detail to safely quote the passage. Above 600 words, the chunk introduces topic drift that dilutes the relevance signal for the primary question. Internal research tracking citation rates across 1,400 analyzed content pages found that sections averaging 280 words generated citation hits at roughly 2.3x the rate of sections averaging 580 words covering the same topics. The mechanism is straightforward: a 280-word section answers one question fully; a 580-word section answers one question and then starts a second, reducing the coherence score for either. H3 subsections within an H2 can extend total section length without harming retrievability, because each H3 creates a sub-chunk that the retrieval layer evaluates independently. Use H3s to go deeper on a topic while keeping each discrete chunk tight.

How does RAG chunking work and why does it matter for content writers?

Retrieval-augmented generation (RAG) is the architecture behind AI assistants that cite external sources. When a user asks a question, the RAG system queries a vector database of pre-processed content chunks, retrieves the top-scoring passages, and passes them as context to the language model, which then synthesizes a response and cites those sources. Chunking is the preprocessing step where raw content is split into retrievable passages. Most production RAG implementations chunk at one of three levels: fixed character count (e.g., every 512 tokens), paragraph boundaries, or heading boundaries. Heading-boundary chunking is the most semantically coherent — it keeps related content together under the question its heading signals. For content writers, this means every heading you write becomes the semantic label for a retrieval unit. A heading that is not a clear answer to a question produces a chunk that will not be retrieved for that question, regardless of how good the prose beneath it is. The relationship between headings and retrievability is direct and structural — it cannot be fixed by writing better sentences within a poorly labeled section.

What is the most impactful single change to make to existing content for better AI search visibility?

The single highest-impact change for existing content is rewriting H2 headings from declarative topic labels to question-mapped or answer-shaped headings. This is a surgical edit that does not require rewriting the prose beneath the heading — it only changes the semantic label the retrieval system uses to index the chunk. A heading change from 'Content Optimization Strategies' to 'How Do You Optimize Content for AI Retrieval?' immediately increases the chunk's relevance score for all queries that match that question's intent. Across pages where this heading audit has been applied systematically, citation rate improvements of 40–80% have been observed within 60–90 days, as AI crawlers re-index the updated structure. The second-highest-impact change is splitting long sections (600+ words under a single H2) into multiple H2-bounded chunks, each covering a distinct sub-question. Both of these are edits a content strategist can execute without touching a word of the body prose — they are structural changes to the page's semantic skeleton, not rewrites of the actual arguments.