Pillar Pages Are Back: Topical Authority for AEO in 2026

Citation share is a measurable metric, but only if you instrument it. A working prompt testing harness that hits ChatGPT, Claude, Perplexity, Gemini, and Grok daily costs $300 to $2,000 a month and answers the questions every CMO is now asking about AI search.

By Nadia Volkov, Enterprise Security · May 25, 2026 · 15 min read

In early 2026, OpenAI told TechCrunch that more than 800 million weekly active users were running prompts through ChatGPT, with a meaningful share asking product-recommendation and category-defining questions. Anthropic disclosed a similar arc for Claude, and Perplexity has been consistently reporting query growth in the high single-digit percentages month-over-month. Across the five major assistants — ChatGPT, Claude, Perplexity, Gemini, and Grok — the volume of category-shaped queries that surface brand citations has eclipsed Google's commercial-intent SERPs for entire B2B verticals.

This shift broke the SEO measurement stack. Rank trackers, traffic analytics, and keyword tools all assume a stable ten-link SERP that anyone with a scraper can audit. AI assistants do not present any such surface. Each response is generated per session, sometimes per user, and the citations that appear inside the answer are the entire game. Without measurement, every AEO budget request is a guess.

The fix is a prompt testing harness — a scheduled job that runs a fixed prompt suite against each assistant on a regular cadence, parses the responses for citations and brand mentions, and stores the results in a time-series format you can chart. The infrastructure is more achievable than most marketing teams realize. A working harness with five engines, a 100-prompt suite, daily cadence, and a usable dashboard ships in two to four engineering-weeks and runs for $300 to $2,000 a month depending on prompt count and engine mix. This piece is the operator's guide to building one — what to instrument, what the costs actually are, what the rate limits look like in practice, and where the build-vs-buy line sits in 2026.

Why Prompt Testing Is the New Rank Tracking

Rank tracking became a $400 million annual market because it solved a measurement problem that mattered. The harness category is on the same trajectory in 2026, and the underlying logic is similar. The companies that win an emerging channel are the ones who can measure it before everyone else can.

A prompt testing harness produces three categories of data that nothing else produces.

Share of citation by engine. For each prompt in your suite, what percentage of runs cite your brand across each assistant? This is the AEO analog of share of voice and the single most useful headline metric for executive reporting. A brand whose share of citation is moving up on ChatGPT but flat on Perplexity has a different problem set than a brand whose share is uniform across engines. The harness exposes that difference.

Competitor citation overlap. When your brand is cited, who else is cited in the same answer? AI assistants do not produce ten blue links — they produce a curated set of three to five names, and the names appearing alongside yours form your real competitive set in the AI search era. A B2B SaaS vendor that thought it competed with three named incumbents often discovers it is being cited alongside a different three based on prompt phrasing. That intelligence does not exist outside the harness.

Feature-claim accuracy. When AI assistants describe your product, are the claims they make accurate? This is the highest-stakes citation question and the one most marketing teams cannot answer. A harness that runs feature-specific prompts — does Brand X support feature Y, what is the price of Brand X — and audits the response against ground truth surfaces hallucination risk before it generates support load.

These three metrics rolled up into a weekly dashboard are the foundation of every serious 2026 AEO measurement stack. The marketing team that ships this measurement layer first is the one that gets the AEO budget the next quarter.

For a broader view on the metrics layer that sits above the harness, the CMO AEO dashboard board deck guide covers the executive reporting framework.

The Reference Architecture

A working prompt testing harness in 2026 has six components. The architecture is intentionally boring — every choice should optimize for reliability, debuggability, and clean cost accounting.

1. Prompt store. The canonical list of prompts you run against each engine, organized by category, intent, and priority. YAML or JSON in a Git repo is the right starting point. Spreadsheets work for small suites but break down past a hundred prompts.

2. Engine adapters. Thin clients for each assistant's API — OpenAI, Anthropic, Perplexity, Google Gemini, and xAI Grok. Each adapter handles authentication, request formatting, response parsing, and the assistant-specific quirks around citation surfacing.

3. Scheduler. A job runner that executes the suite on a fixed cadence — daily for the priority set, weekly for the long tail. Cron, GitHub Actions, Render Cron Jobs, or Temporal all work. The constraint is that the schedule must be deterministic and the run history must be auditable.

4. Citation parser. A layer that takes the raw response from each engine and extracts the structured citation set — brands mentioned, URLs cited, position within the response, and any quoted text. This is the component that drifts most as the engines change their output formats, so it should be designed to be easy to update.

5. Storage. A time-series store that holds the raw responses, the parsed citation set, and the metadata for each run. Postgres works at the scale most teams need. ClickHouse or BigQuery if you are tracking more than 10,000 prompts.

6. Dashboard. A query layer and visualization surface that translates the raw data into the share-of-citation, overlap, and accuracy metrics you actually report on. Metabase, Hex, or a simple Next.js dashboard with Recharts all do the job.

The cleanest mental model: the harness is a tiny data pipeline. Treat it as one. Version control the prompt store, deploy the runners through CI, monitor the jobs with the same observability you use for production services. Marketing teams that try to run this stack on shared spreadsheets and ad hoc Python scripts spend more on maintenance than they would have spent building it right.

Engine Coverage and What Each One Returns

The five assistants worth instrumenting in 2026 are ChatGPT, Claude, Perplexity, Gemini, and Grok. The engines differ enough in how they handle citations, browsing, and rate limits that each one deserves its own adapter.

Engine	API Endpoint	Citation Format	Pricing (per 1M tokens)	Rate Limit (Tier 1)
OpenAI ChatGPT	chat/completions, responses	Inline URLs in text, optional web_search tool returns structured citations	GPT-5: ~$2.50 input / $10 output	500 RPM, 200K TPM
Anthropic Claude	messages	URLs in text body, web_search tool returns citations array	Claude 4.5 Sonnet: ~$3 input / $15 output	50 RPM, 40K TPM (Tier 1)
Perplexity	chat/completions	Structured citations array in every response	Sonar Pro: ~$3 input / $15 output + $5/1K requests	50 RPM (basic), 2000 RPM (Pro)
Google Gemini	generateContent	Grounding metadata with web sources when Search tool enabled	Gemini 2.5 Pro: ~$1.25 input / $5 output	360 RPM (paid tier)
xAI Grok	chat/completions	Inline URLs, structured citations in Live Search mode	Grok 4: ~$3 input / $15 output	60 RPM (default)

OpenAI ChatGPT. The Chat Completions and Responses APIs both support tool use, and the OpenAI web_search tool returns structured citation objects with URL, title, and snippet for any query that triggers a browse. For AEO harness purposes, enabling web_search is non-negotiable — without it, the model answers from its training data only, which is not representative of what a real ChatGPT user sees in the product. Rate limits scale with usage tier; production AEO harnesses typically land in Tier 3 or Tier 4, with 5,000 RPM and several million TPM, which is sufficient for a daily 500-prompt suite.

Anthropic Claude. The Claude Messages API supports a similar web_search tool that returns structured citations. Claude tends to cite more conservatively than ChatGPT and is more willing to explicitly decline to recommend specific products in answer sets, which makes Claude data useful as a noise floor — if Claude cites your brand, the citation signal is durable. Rate limits at Tier 1 are tight at 50 RPM and 40K TPM, which means a 100-prompt suite hits the limit unless you spread the runs across the hour. Tier 3 and 4 are where most production AEO harnesses operate.

Perplexity. The Perplexity Sonar API is the citation-friendliest of the major engines. Every response includes a structured citations array, the model is designed around web-grounded answers, and the API is purpose-built for the use case. Pricing is metered by tokens plus a per-request fee, which makes Perplexity the most expensive engine per prompt but also the highest-signal engine for citation analysis. Rate limits on the basic tier are 50 RPM, which is restrictive. Pro tier and enterprise raise the ceiling substantially.

Google Gemini. The Gemini API supports a Google Search grounding tool that returns grounding metadata when enabled. The citation format is different from OpenAI and Anthropic — Gemini returns a list of supporting web sources tied to specific segments of the response, which requires a different parser. Pricing is the most aggressive of the five engines, which makes Gemini the cheapest to run at scale.

xAI Grok. The Grok API exposes a Live Search mode that returns structured citations. Coverage and quality are improving rapidly but vary by topic. For B2B and SaaS categories, Grok citation share is meaningful enough in 2026 that excluding it from the suite means missing a real channel.

The harness should run all five engines for any prompt in the priority tier. The cost difference between running four engines and five is marginal, and the comparative signal across engines is one of the most useful outputs.

The Build-vs-Buy Decision in Practice

The managed AEO tooling market in 2026 is dominated by three vendors — Profound, Otterly, and Peec — alongside enterprise SEO suites like Ahrefs and Semrush that have added citation tracking modules. The detailed comparison sits in the Profound vs Otterly vs Peec vs Ahrefs shootout; here the question is narrower: when does it make sense to build the harness yourself, and when does it make sense to buy?

Buy when:

Your measurement needs are standard — share of citation, competitor overlap, basic accuracy auditing — and you do not need to integrate with internal systems.
You want a working dashboard within 48 hours of a purchase order and you do not have engineering capacity to spare.
Your prompt taxonomy fits the vendor's prompt template — typically a few thousand canned prompts across major categories, plus a custom prompt slot.
The marketing team needs a self-serve UI and does not want to depend on engineering for every report.

Build when:

You need to integrate the citation data into your existing data warehouse, customer data platform, or attribution model.
Your prompt taxonomy is non-standard — internal-only product names, niche vertical categories, or competitive intelligence prompts that you do not want a third-party vendor to see.
You want the harness to feed real-time alerts into Slack, PagerDuty, or your incident management system when a high-priority prompt loses your brand citation.
You are running at a scale where the per-prompt unit economics of a DIY harness materially beat the managed tool's seat or volume pricing.

Hybrid approach. The pattern we see most often is a managed tool for the executive dashboard and a DIY harness for engineering-grade analysis. Profound or Otterly handles the daily share-of-citation chart that goes in the CMO's deck. The DIY harness handles the competitive intelligence prompts, the feature-claim auditing, and the integration with the rest of the data stack. This split lets the marketing team get a clean UI without giving up the deeper analytic surface.

The honest cost comparison: a managed tool at the $1,500-per-month entry point gets you a working dashboard, a curated prompt library, and a vendor-maintained citation parser. A DIY harness at the equivalent monthly cost gets you raw API spend plus infrastructure plus the engineering time to build and maintain it. The DIY harness is cheaper if you have spare engineering capacity. The managed tool is cheaper if you do not.

Reference Implementation: A Promptfoo-Based Harness

The fastest way to ship a working harness in 2026 is to use Promptfoo as the execution layer and bolt a custom citation parser and storage layer on top. Promptfoo handles the parallel execution, rate-limit backoff, response caching, and assertion model out of the box. The open-source repo is at github.com/promptfoo/promptfoo.

A minimal Promptfoo config for a five-engine AEO suite looks like this.

```yaml description: AEO citation tracking suite

providers: - id: openai:gpt-5 config: tools: - type: web_search max_tokens: 2000 - id: anthropic:claude-4-5-sonnet config: tools: - type: web_search_20250305 max_uses: 5 - id: https://api.perplexity.ai/chat/completions config: headers: Authorization: 'Bearer ${PERPLEXITY_API_KEY}' body: model: 'sonar-pro' return_citations: true - id: google:gemini-2.5-pro config: tools: - googleSearch: {} - id: xai:grok-4 config: search_parameters: mode: 'on'

prompts: - 'What is the best {{category}} for {{persona}}?' - 'Compare {{brand}} and {{competitor}} for {{use_case}}.' - 'What companies offer {{product_category}} in 2026?'

tests: - vars: category: 'project management tool' persona: 'engineering teams' assert: - type: contains value: 'Linear' - type: javascript value: | const citations = extractCitations(output); return citations.length >= 3; ```

The execution model is straightforward. Run `promptfoo eval` on a cron, parse the JSON output, push citations to Postgres, and chart the results. A fully functional harness — config, parser, scheduler, storage — fits in roughly 800 lines of Python and TypeScript and ships in a sprint.

For the storage and dashboard layer, the lightest-weight viable stack is Render or Railway for the scheduler, Supabase or Neon for Postgres, and Metabase for the dashboard. Combined infrastructure cost runs $50 to $150 per month for a 500-prompt suite. The dominant cost is LLM API spend.

For teams that want a deeper view on how this data flows into a citation share dashboard, the multi-engine share of citation dashboard build guide covers the visualization layer in detail.

The 90-Day Implementation Playbook

For a marketing or growth team standing up an AEO harness from zero in 2026, the sequence that consistently works.

1. Define the prompt suite first. Before any code is written, write the prompt list. Start with 50 prompts across three categories — head-term category queries, comparison queries, and feature-claim queries. Aim for prompts that real prospects ask. Phrasings like best CRM for B2B SaaS, alternatives to HubSpot, and does Pipedrive integrate with Slack are the right shape. Pure SEO keywords like CRM software are not.

2. Build the cheapest possible MVP. Wire up one engine — Perplexity is the lowest-friction starting point because the citation format is structured — and a single cron that runs the prompt suite daily and dumps the responses to a Postgres table. No dashboard yet. The goal is to confirm the data flow before you invest in visualization.

3. Add the other four engines one at a time. Add OpenAI second, Anthropic third, Gemini fourth, Grok fifth. Each engine takes one to two engineering days because of authentication, response parsing, and rate-limit handling differences. Resist the urge to add all five at once — the debugging compounds when one breaks the others.

4. Build the citation parser. Write a normalized schema for citations — brand name, URL, position, snippet, engine, timestamp — and a parser that converts each engine's raw response into the schema. Plan to iterate. The first version of the parser will miss 10% to 20% of citations because brands appear in many forms (Salesforce, salesforce.com, SF, Salesforce.com Inc) and the dedupe logic takes work.

5. Ship a dashboard the CMO can read. Three charts to start: share of citation by engine over time, top competitors cited alongside your brand, and citation accuracy on feature-claim prompts. Metabase or Hex handles this in a day. Avoid the temptation to build a custom Next.js dashboard until the data flow is stable.

6. Add alerting. Once the dashboard is reliable, add Slack alerts for material citation changes — a competitor breaking into a head-term answer, a sudden drop in citation rate on a priority prompt, a new domain appearing in the citation set. Alerting is what makes the harness operationally useful versus a weekly report.

7. Expand the prompt suite quarterly. Add 50 to 100 prompts per quarter as new categories, products, and competitive dynamics emerge. The harness compounds in value as the prompt suite grows, because longitudinal data on a stable prompt set is more useful than spot checks on a constantly-changing list.

8. Audit the parser monthly. AI assistants change response formats more often than most teams expect. Run a manual audit of 20 random parsed responses every month to catch parser drift early.

Teams that follow this sequence have a production-grade harness running in 60 to 90 days. Teams that try to build the full system in one push typically take twice as long and end up with brittle infrastructure.

Cost and Rate Limit Realities

Real numbers from production harnesses

The headline numbers from harnesses we have seen running in production at B2B SaaS companies in 2026.

Small harness. 50 prompts, daily cadence, five engines. Approximately 7,500 LLM calls per month. Monthly cost: $180 to $400 in API spend depending on engine mix, plus $50 to $100 in infrastructure. Total: $230 to $500 per month.

Medium harness. 200 prompts, daily cadence, five engines, plus weekly cadence on an additional 500 long-tail prompts. Approximately 40,000 calls monthly. Monthly cost: $700 to $1,400 in API spend, plus $100 to $200 in infrastructure. Total: $800 to $1,600.

Large harness. 500 prompts daily, 2,000 prompts weekly, five engines, alerting, custom dashboards, and integration with the data warehouse. Approximately 130,000 calls per month. Monthly cost: $1,800 to $3,500 in API spend, plus $300 to $500 in infrastructure. Total: $2,100 to $4,000.

The cost driver in all three cases is Perplexity, which combines a per-token charge with a per-request fee and runs roughly 2x the cost of ChatGPT or Gemini on equivalent prompts. The cheapest engine is Gemini, which is roughly 40% the cost of the others on equivalent prompts. Teams that want to reduce harness cost typically pull Perplexity down to the priority tier only and let Gemini absorb the long-tail volume.

Compared to the $1,500 to $5,000 per month that managed tools charge, a DIY harness at the medium-to-large tier is roughly cost-neutral if you do not value engineering time, and meaningfully cheaper if you have engineering capacity that would otherwise be slack. Buy the managed tool if your engineering team is over capacity. Build if you have a half-time engineer to dedicate.

Rate limits in practice

The published rate limits and the rate limits you actually hit in production are different numbers. The patterns to plan for.

OpenAI. Tier 1 limits are 500 RPM, but they scale rapidly with usage and payment history. Tier 4 (which most production harnesses reach within a quarter) is 10,000 RPM and 30M TPM. The web_search tool adds latency — typically 5 to 15 seconds per call — which means parallel execution matters more than RPM ceiling for harness throughput.

Anthropic. Tier 1 limits are restrictive at 50 RPM, and the tier-up process takes longer than OpenAI's. Production harnesses run at Tier 3 or 4 with 4,000 RPM. The web_search tool is metered separately and adds $10 per 1,000 searches on top of token costs.

Perplexity. Basic tier is 50 RPM, which is restrictive for a daily 500-prompt suite. The Pro tier raises the ceiling substantially but adds a per-seat fee. For high-volume harnesses, the Sonar API enterprise tier is the only viable option.

Gemini. Paid tier is 360 RPM by default and scales with usage. The grounding tool adds latency similar to OpenAI's web_search but at lower marginal cost.

Grok. Default is 60 RPM with limited tier visibility. Production harnesses typically need to coordinate with xAI for elevated limits if running more than a few hundred prompts daily.

The practical implication is that the harness scheduler should distribute calls evenly across the hour, implement exponential backoff on 429 responses, and queue failed calls for retry rather than dropping them. Promptfoo handles most of this out of the box; custom harnesses need to implement it explicitly.

What Goes in the Prompt Suite

The shape of the prompt suite drives the value of the harness. A 500-prompt suite of bad prompts produces less useful data than a 50-prompt suite of good ones. The categories worth instrumenting.

Head-term category prompts. What is the best CRM, what is the top observability platform, who are the leading vendors for X. These prompts are the highest-stakes citation surface — being cited in the head-term answer is the equivalent of ranking #1 for the head keyword in 2015. The suite should cover every major head term in your category.

Comparison prompts. Compare X and Y, X vs Y for use case Z, alternatives to X. These prompts capture switching and evaluation intent and tend to surface a different competitive set than head-term prompts. Cover the top 10 to 15 competitors with comparison prompts.

Feature-claim prompts. Does X support Y, what is the price of X, how does X integrate with Z. These prompts surface accuracy risk and are the leading indicator of support load. Cover the top 30 to 50 features and pricing facts about your product.

Persona-shaped prompts. Best X for engineering teams, top Y for early-stage startups, recommended Z for enterprise IT. These prompts capture segment-specific positioning and are the cleanest way to measure whether your AEO investment in vertical content is moving the needle.

Long-tail prompts. Specific use-case queries that fall outside the head terms. These prompts are individually low-volume but collectively important because they reveal where your brand is being cited for use cases you did not target deliberately.

A balanced 200-prompt suite usually allocates 30% to head terms, 25% to comparison, 25% to feature-claim, 10% to persona, and 10% to long-tail. The exact split depends on the category and the stage of the AEO program.

What Kills Harness Projects

Patterns that consistently break AEO harness implementations in 2026.

Treating the harness as a one-time project. The harness needs ongoing maintenance — parsers drift, prompts go stale, engines change their formats. Teams that ship the harness and walk away end up with a dashboard that quietly stops working within a quarter.

Building the dashboard before the data flow. Teams that invest in a custom dashboard before the underlying data is reliable end up rebuilding the dashboard when they discover the parser is wrong. Ship the data pipeline first, use a generic dashboard tool like Metabase, and invest in custom UI only after the data is solid.

Ignoring response caching. Many AEO prompts return roughly stable responses over short time windows. Running the same prompt every hour wastes API budget without producing additional signal. A 12-hour to 24-hour response cache on most prompts cuts cost meaningfully without losing freshness.

Underestimating brand-name normalization. The same brand appears in citations as Stripe, stripe.com, Stripe Inc, and Stripe, Inc. depending on the engine and the response. Without normalization, the citation count is wrong. Plan for a normalization layer from day one.

Not logging raw responses. Teams that only store parsed citations end up unable to re-analyze historical data when the parser improves. Storage is cheap; raw responses should be persisted indefinitely.

Running the harness on a developer's laptop. A harness that depends on someone manually running a script breaks the first time that person is on vacation. Run it on managed infrastructure from day one.

The Vendor Landscape in 2026

Beyond Promptfoo, the open-source and commercial tools worth knowing.

Profound has emerged as the category-leading managed AEO tool, with a strong dashboard, an extensive prompt library, and enterprise pricing in the $2,000 to $10,000-per-month range. Their public materials at tryprofound.com document the category clearly.

Otterly is a strong mid-market option with self-serve pricing starting around $500 per month. Their blog at otterly.ai is one of the better public sources on AEO measurement methodology.

Peec AI focuses on European markets and multi-language tracking, with pricing competitive with Otterly. Useful for teams operating in multiple languages.

Ahrefs and Semrush have both added AI search modules to their existing SEO suites. The integration with traditional SEO data makes them attractive for teams already on those platforms.

LangSmith and Helicone are LLM observability tools that are not AEO-specific but provide useful infrastructure for monitoring API spend, response latency, and error rates on a DIY harness.

The recommendation for most teams in 2026 is to start with Promptfoo plus Metabase for the build path, or Otterly or Peec for the buy path, and only graduate to enterprise tooling once the AEO program has demonstrated clear value to the executive team. The space is moving fast enough that committing to a $50,000-annual contract before the program is proven is rarely the right call.

Takeaway: A prompt testing harness is the foundational measurement layer for AEO in 2026, and it is more buildable than most marketing teams assume. A working harness across five engines, with a 200-prompt suite running daily, ships in 60 to 90 days and runs for under $1,500 a month. The build-vs-buy decision is real but not binary — the most effective teams pair a managed tool for the executive dashboard with a DIY harness for the deeper analytic surface. The marketing team that ships this measurement layer first gets the next round of AEO budget because they can prove the channel works. The teams that wait spend another year guessing.

Frequently Asked Questions

What is a prompt testing harness for AEO and why do I need one?

A prompt testing harness is the AEO equivalent of a rank tracker — a scheduled job that runs a fixed list of prompts against each major AI assistant on a regular cadence and records the responses, citations, and brand mentions. You need one because the AI search surface is opaque by default. Unlike Google, where SERP scrapers have been a commodity for fifteen years, AI assistants do not expose ranking data, and the answers are generated dynamically per session. Without a harness, your team has no measurement layer for the channel that increasingly drives top-of-funnel discovery. With one, you can track share of citation over time, detect when a competitor breaks into a head-term answer, audit feature-claim accuracy, and report channel performance to a board that now expects AI search to be measured the same way paid search and organic search have been measured for a decade.

How much does it cost to run a prompt testing harness?

Costs range from roughly $300 per month for a small DIY harness to $2,000 a month or more for production-grade infrastructure, with managed vendor tools sitting between $500 and $5,000. A 100-prompt suite run daily across the five major assistants generates about 15,000 LLM API calls per month. At average token costs in 2026, that runs $250 to $600 in raw API spend. Add $20 to $80 for Perplexity API and a similar amount for Grok and Gemini, then $50 for a scheduling and storage layer like Render, Railway, or a small Postgres instance. A 500-prompt enterprise suite tripled in cadence runs closer to $2,000 per month including infrastructure, monitoring, and storage. Managed tools like Profound, Otterly, and Peec price by tracked prompts, brands, and engines, with starter plans around $499 monthly and enterprise tiers exceeding $5,000.

Should I build my own harness or buy Profound, Otterly, or Peec?

Build if you have an engineer with at least 20% capacity and your measurement needs are non-standard — custom prompt taxonomies, internal data sources, or integration with your existing data warehouse. Buy if you want a working dashboard in 48 hours and you do not need the data to flow into a custom pipeline. The honest tradeoff is that managed tools save you four to six weeks of engineering and give you a UI your marketing team can use without help, but they constrain the prompt taxonomy and the citation parser to whatever the vendor supports. DIY gives you full control and lower per-prompt cost at scale, but you are now operating a small data pipeline with all the maintenance that implies. The pattern we see most often in 2026 is companies starting with a managed tool to validate the measurement layer, then migrating to DIY once the use case is well-defined.

Which AI assistants should the harness cover and at what cadence?

Cover ChatGPT, Claude, Perplexity, Gemini, and Grok at minimum. The five assistants together represent more than 95% of AI search traffic in 2026, and the citation behavior between them differs enough that a measurement on any single engine misses important signal. Cadence should be daily for the top 20 to 50 highest-priority prompts and weekly for the long tail, because AI assistant answers shift more than most teams expect — a competitor mention can appear, disappear, and reappear within a week as the underlying retrieval-augmented generation pipeline updates. Cadence above daily is rarely useful because individual response variation between consecutive calls dominates real signal. Run the suite at a fixed time of day in a single time zone to keep the data comparable, and log the full raw response in addition to the parsed citation set so you can re-parse historical data when your extraction logic improves.

What does Promptfoo do and how does it fit into an AEO harness?

Promptfoo is an open-source testing framework originally built for prompt engineering and LLM evaluation, but its declarative test-suite model makes it a useful foundation for an AEO citation harness. You define prompts in YAML, configure providers for OpenAI, Anthropic, Perplexity, Google, and others, and run the suite from the command line or CI. Promptfoo handles parallel execution, rate-limit backoff, response caching, and assertion-based evaluation, which means you can write assertions like response must include brand name X or response must not cite competitor Y and have the harness flag failures automatically. For AEO use, Promptfoo handles the execution and assertion layer; you typically still need a separate parser for citation extraction and a storage layer for time-series analysis. It is free, well-documented at promptfoo.dev, and the most common starting point for engineering teams building AEO harnesses in-house in 2026.