How to Build a Multi-Engine AI Citation Dashboard From Scratch

Tracking ChatGPT, Perplexity, Claude, Gemini, and Copilot simultaneously requires a different architecture than any existing analytics tool provides. Here is the build guide.

By Emily Sato, Consumer Social · May 25, 2026 · 15 min read

In Q1 2026, Bain & Company's research on AI-influenced B2B purchasing found that 67% of enterprise software buying committees used at least one AI assistant to generate a vendor shortlist before issuing an RFP. The companies that appeared on those AI-generated shortlists won first-call meetings at 3x the rate of companies that did not. The companies that could measure their shortlist appearance rate — and optimize it — were a still-smaller group. That measurement gap is the problem this article solves.

Multi-engine AI citation tracking is the discipline of systematically querying ChatGPT, Perplexity, Claude, Gemini, and Microsoft Copilot to measure how often your brand appears in AI-generated answers across a defined set of queries — and then tracking that rate over time. As of May 2026, no off-the-shelf analytics tool does this comprehensively across all five major engines. The teams with the most sophisticated citation intelligence have built their own infrastructure. This is the build guide.

Why Multi-Engine Tracking Is Different From Single-Engine Spot Checks

Most AEO programs start with a simple test: go to ChatGPT, type in a few category queries, see whether the brand appears. That exercise is useful as a sanity check. It is not useful as a measurement system, for three structural reasons.

Citation behavior varies significantly across engines. In our testing across 40 B2B SaaS categories in Q1 2026, the overlap in cited brands between ChatGPT and Perplexity averaged 61% — meaning 39% of the brands cited in Perplexity answers for a given category did not appear in ChatGPT answers for the same query. The overlap between Claude and Gemini was even lower, at 54%. A brand that appears confidently in ChatGPT category responses may be effectively invisible in Perplexity, which increasingly serves buyers doing active vendor research. Single-engine tracking systematically misrepresents total citation exposure.

Model updates shift citation behavior unpredictably. When OpenAI released GPT-4o in April 2024, citation patterns in several B2B categories shifted by 15 to 25 percentage points within two weeks of the release as the new model's different training data produced different brand associations. Teams tracking only one engine at weekly frequency often could not tell whether their citation share changed because of their own content investments or because of an upstream model update. Multi-engine tracking lets you triangulate: if your share drops on one engine but holds on others, it is likely a model update rather than a content problem. If share drops across all engines simultaneously, you have a real issue.

Perplexity and Claude serve different buyer intents. The evidence from clickstream and survey data is consistent: Perplexity users skew toward active research (comparing specific products, building shortlists), while ChatGPT users skew toward broader informational queries. Claude users, in our survey data, report higher rates of deep-dive research queries on technical topics. A B2B brand that dominates ChatGPT citations but is absent from Perplexity is winning awareness and losing consideration. Without multi-engine tracking, this pattern is invisible.

The AEO citation tracking playbook covers the measurement philosophy in depth. This article focuses on the technical architecture of building the tracking system itself.

The Five Engines and Their Tracking Characteristics

Before designing the architecture, you need to understand the data access model for each engine. They are not equivalent.

Engine	API Available	Web Search in API	Response Includes Citations	Notes
ChatGPT (GPT-4o)	Yes (OpenAI API)	Optional (tool use)	Inline text, not structured	Browsing adds latency and cost
Perplexity	Yes (beta API)	Always on	Inline sources list	Most citation-rich responses
Claude (Sonnet/Opus)	Yes (Anthropic API)	No (as of May 2026)	Training data only	Best for entity-association testing
Gemini	Yes (Google AI Studio)	Optional	Variable by model	Gemini 1.5 Pro most useful
Microsoft Copilot	No public API	Always on	Structured source list	Requires web simulation

This table determines your data collection architecture. ChatGPT, Claude, and Gemini are straightforward API integrations. Perplexity is in beta but accessible with a waitlist key. Copilot requires browser automation via Playwright or Puppeteer, which introduces maintenance overhead and fragility that the API-based integrations do not have.

The practical implication for most teams: start with the four that have accessible APIs (ChatGPT, Perplexity, Claude, Gemini) and add Copilot simulation in a later phase. For B2C brands in high-intent categories — travel, financial products, consumer software — Copilot is important enough to prioritize earlier because Bing's integration with Copilot means it captures significant purchase-intent traffic.

Designing the Prompt Set

The prompt set is the most strategically important component of the system and the one most teams underinvest in. The queries you run determine what the system can and cannot detect. A poorly designed prompt set produces data that looks like measurement but does not capture the citation behavior that actually matters for your business.

A well-structured prompt set for a B2B SaaS brand in a competitive category has five query layers.

Layer 1: Head-term category queries. These are the broadest questions a buyer might ask early in a research process. Examples: What are the best project management tools for engineering teams? or Which CRM is recommended for enterprise sales teams? Head-term queries establish your baseline category visibility and are the most comparable across engines. Aim for 8 to 12 of these per category.

Layer 2: Comparison and alternatives queries. These target the switching intent that produces the highest-converting AI-referred traffic. Examples: What are the best alternatives to Jira for startups? or How does HubSpot compare to Salesforce for mid-market companies? These queries are where comparison-page investments show up in citation data. Aim for 10 to 15 per category, including both your own brand name in comparisons and the top two or three competitor names.

Layer 3: Feature and use-case queries. These are the specific-functionality questions that buyers ask when they are in active evaluation. Examples: Which project management tools support sprint planning with Jira integration? or What CRM tools have the best email sequence automation? Feature queries test whether your documentation and product pages are informing model responses about your capabilities. Aim for 12 to 20 per category.

Layer 4: Brand-direct queries. These test what the AI engines say about your brand specifically. Examples: What does [Brand] do? or Who uses [Brand] and what do they say about it? Brand-direct queries detect citation accuracy problems — cases where the model's information about your product is wrong or outdated. Aim for 5 to 8 of these.

Layer 5: Competitor-frame queries. These test whether your brand appears in responses to queries framed around your competitors. Examples: What do [Competitor] users complain about? or Is there a better option than [Competitor] for [specific use case]? Competitor-frame queries measure whether your comparison-page and alternatives-content investments are working. Aim for 8 to 12 per category.

A full prompt set for a single competitive category runs 43 to 67 queries. Running this set weekly across five engines produces 215 to 335 API calls per week — well within the rate limits and budget of any standard API tier.

The Data Collection Architecture

With the prompt set designed, the collection architecture has three layers: the runner, the parser, and the store.

The runner. The runner is the component that submits each prompt to each engine and retrieves the response. For API-based engines, this is straightforward HTTP client code. A basic Python implementation using the `openai`, `anthropic`, and `google-generativeai` libraries can run a full 50-query set across four engines in under three minutes on a standard server. For Perplexity, use the beta API endpoint with the `llama-3-sonar-large-32k-online` model or equivalent current model. For Copilot, use Playwright with a headless Chromium instance authenticated via a Microsoft personal account.

Key runner design decisions:

1. Run each engine in parallel, not sequentially. Sequential execution of a 50-query set across 5 engines takes 15 to 25 minutes. Parallel execution takes 3 to 5 minutes. Use Python's `asyncio` or a job queue like Celery for parallelism.

2. Log the raw response text, not just the parsed result. Storage is cheap. The ability to re-parse historical responses with improved entity detection logic is valuable. Always store the full response text, not just the extracted mention flag.

3. Capture model version metadata. ChatGPT's GPT-4o, GPT-4 Turbo, and GPT-3.5 produce meaningfully different citation patterns. Log the model version with every response so you can distinguish version-driven changes from content-driven changes.

4. Add jitter to API calls. Running all queries simultaneously can trigger rate limiting on engines with strict per-minute limits. Adding 1 to 3 seconds of random delay between calls within a batch prevents this.

The parser. The parser reads each raw response and produces structured citation data. The minimum output is a boolean citation flag — was the target brand mentioned, yes or no? More useful output includes citation position (first mention, second mention, not mentioned), citation context (was the brand recommended positively, neutrally, or negatively), and competitor co-citations (which other brands appeared in the same response).

Building a reliable parser requires handling the real-world variability in how AI engines refer to brands:

Direct name mention: Linear is a popular choice
Possessive: Linear's project management approach
URL reference: linear.app
Abbreviated: LIN (rare but occurs in some responses)
Paraphrased description: the modern issue tracker that Vercel and Loom use (no name but identifiable)

For most teams, a regex-based parser that catches direct name mentions, possessives, and URL references covers 90 to 95% of citation events. The edge cases (paraphrased descriptions) are important for large-scale programs but can be addressed in a later iteration using a small classification model.

The store. The storage layer has two tables: a raw response table and an aggregated metrics table.

The raw response table has columns: `response_id` (UUID), `query_id` (FK to prompt set), `engine`, `model_version`, `response_text` (full text), `response_timestamp`, `run_id`.

The aggregated metrics table has columns: `metric_id` (UUID), `response_id` (FK), `brand`, `cited` (boolean), `citation_position` (integer, null if not cited), `citation_sentiment` (positive/neutral/negative/null), `competitor_co_citations` (array), `parsed_timestamp`.

PostgreSQL with JSONB for the response text and a standard relational schema for metrics works well at the scale of most AEO programs. For teams running larger programs (1,000+ queries per week), BigQuery or Snowflake handles the analytical query load better.

Building the Visualization Layer

Raw citation data is not actionable without visualization. The three views that matter most:

Share-of-citation over time by engine. A line chart showing, for each engine, the percentage of queries in your prompt set that cited your brand. This view answers the fundamental question: is our citation share going up or down, and on which engines? Plot this weekly. Overlay model update events as vertical markers so you can separate content-driven changes from model-driven changes.

Competitive citation matrix. A heatmap or table showing, for each query in your prompt set, which brands were cited across each engine. This view reveals where competitors have citation advantages you do not have, and which query types are your weakest. A query where three competitors appear consistently and you do not is a direct brief for a content investment.

Citation accuracy tracker. For brand-direct queries, a checklist of factual claims the AI engines make about your brand, manually verified against your actual product. This view catches accuracy problems — wrong pricing, deprecated features, incorrect use-case associations — that content updates can fix. At least one company (a mid-market CRM vendor) discovered in early 2026 that ChatGPT was consistently citing their legacy pricing plan, which had been discontinued 14 months earlier. The fix was a documentation update and a product page revision. Citation accuracy on that dimension corrected within six weeks.

For tooling, Metabase is the fastest path to a working dashboard on top of PostgreSQL — it handles time-series charts, heatmaps, and table views without custom front-end development. Teams with more visualization ambition use Looker Studio (free, integrates with BigQuery), Grafana (better for real-time monitoring needs), or a custom React front-end with Recharts or Victory.

Alerting and Trend Detection

A dashboard that requires manual review to detect problems is a dashboard that will be ignored. The tracking system needs automated alerting for three conditions.

Citation share drop alert. If your share-of-citation on any engine drops more than 15 percentage points in a single week, trigger an alert. This threshold is calibrated to distinguish noise from signal — week-to-week variance in a 50-query prompt set runs 3 to 8 percentage points, so 15 points is a meaningful departure. Send the alert to Slack with the affected engine, the previous week's share, the current week's share, and a link to the dashboard.

Citation accuracy degradation alert. If a brand-direct query that previously produced accurate responses now produces an inaccurate claim, flag it immediately. This can be detected by hashing the key claim text from previous responses and comparing to current responses — any change in a factual claim field should trigger a review. This alert is lower volume but high priority because inaccurate AI claims about your product reach prospects who may never visit your site to check.

Competitor citation surge alert. If a competitor's citation share on your tracked queries increases more than 20 percentage points week-over-week, trigger an alert. This signals that a competitor has shipped content that is being picked up by the models. Review which specific queries drove the change — that is the editorial brief for your response.

For notification infrastructure, a simple Python script that queries the aggregated metrics table and posts to a Slack webhook covers all three alert types. PagerDuty or OpsGenie integration is overkill for most AEO programs.

ChatGPT API vs Web Scraping: The Trade-Off

One of the first architecture decisions teams face is whether to use the official OpenAI API or to scrape ChatGPT via browser automation. The trade-offs are real on both sides.

The API gives you clean, reproducible, cheap responses. A GPT-4o API call costs roughly $0.005 per query at current pricing — a 400-query weekly run costs about $2. Response time is fast and rate limits are generous on the standard tier. The catch is that API responses do not include real-time web browsing by default. The API reflects the model's training-data knowledge, not live web content. For measuring brand citation in training data, this is appropriate and sufficient. For measuring real-time citation behavior — the kind that reflects your most recent content investments — you need either web search tool calls (available in the API) or browser-based testing.

Web scraping gives you live-browsing responses but is fragile. ChatGPT's web interface includes browsing, but scraping it via Playwright is subject to rate limiting, UI changes, and potential terms-of-service violations. Most teams that start with web scraping migrate to the API within 90 days when the maintenance burden becomes clear.

The practical answer is both, for different use cases. Use the API for high-frequency automated tracking (daily or weekly citation share measurement) where reproducibility and cost efficiency matter. Use manual web interface testing for spot checks and for verifying that your most recent content investments are being picked up by the live, browsing-enabled model. Document which data points came from which method so your analysis does not conflate the two.

This distinction matters more for some categories than others. Brands competing in rapidly evolving topics — AI tools, software platforms, financial products with current pricing — will see larger differences between API (training data) and web (live search) citation rates than brands in stable categories. For the share-of-model measurement framework, understanding which signal you are measuring is foundational.

Perplexity and Claude Tracking Specifics

Perplexity and Claude have distinct behaviors that affect how you interpret their citation data.

Perplexity always browses. Every Perplexity response includes live web citations, and the engine surfaces inline sources with each factual claim. This makes Perplexity the best engine for measuring the citation impact of recent content investments — a piece of content published this week can appear in Perplexity citations within days of indexing. The Perplexity API returns the source URLs alongside the response text, which allows you to parse not just whether your brand was mentioned but whether your own domain was cited as a source. Owning the citation source is distinct from being mentioned — a brand can be mentioned in a Perplexity response while the citation source is a competitor's comparison page or a third-party review.

Claude is conservative about brand recommendations. Claude 3.5 and Claude 4 are notably more cautious than ChatGPT or Perplexity about naming specific brands in response to category queries. In our category testing, Claude produced a direct brand recommendation in 61% of category queries where ChatGPT produced one in 84% of the same queries. This means Claude citation rates will be structurally lower than ChatGPT rates for the same brand, and the gap is not necessarily a problem. The more useful Claude signal is entity association: does Claude describe your brand's value proposition accurately when directly queried? Does it associate your brand with the category and use case you want to own? These entity-association signals are leading indicators of long-term citation authority across all engines, because they reflect the model's underlying knowledge state.

Gemini is the most volatile. Citation patterns in Gemini vary more across model versions (Gemini 1.0, 1.5, 2.0) than any other engine. When tracking Gemini, always log the model version and treat model updates as potential discontinuities in your trend line. Gemini's integration with Google Search means its live-browsing citation patterns are closely correlated with Google organic rankings — brands that rank well in Google Search for a category tend to be cited well in Gemini. This makes Gemini citation share partly a proxy for organic SEO health, and it means the remediation for poor Gemini citation often runs through AI Overviews and standard SEO signals rather than AEO-specific content investments.

Staffing the Dashboard: Team Workflows

The technical architecture is only half the build. The other half is designing the team workflow that turns citation data into content decisions. A dashboard that produces data nobody acts on is infrastructure waste.

The workflow that works has three cadences.

Weekly: citation rate review. Assign one person (typically the AEO lead or a senior content strategist) to review the weekly share-of-citation report. The review should take 20 to 30 minutes and produce two outputs: a list of queries where citation share declined (potential content problems), and a list of queries where competitors gained share (competitive content opportunities). These outputs feed directly into the content planning process.

Monthly: deep diagnostic. Once a month, run a more detailed audit of citation accuracy, competitive positioning, and prompt-set coverage. Review whether your prompt set still reflects the actual queries buyers are asking — query patterns evolve as categories mature and as new use cases emerge. Update the prompt set quarterly at minimum. Review citation accuracy on brand-direct queries and triage any inaccuracies for content fixes. Compare your citation share trajectory across engines to identify which engines need the most attention.

Quarterly: strategy review. Present citation share trends to marketing leadership alongside the AEO metrics that belong in a board deck. The citation dashboard feeds into the share-of-model metric that CMOs are increasingly reporting to boards. Quarter-over-quarter trend data is more defensible than point-in-time snapshots, which is why building the tracking infrastructure early — even before the data is actionable — compounds in value over time.

The playbook for turning dashboard data into content action runs as follows.

1. Identify the citation gap. Pull the competitive citation matrix for the query types where your brand citation share is lowest. Rank by query volume (estimated from third-party keyword tools) times citation gap (competitor share minus your share). The highest-ranked items are your highest-priority content investments.

2. Audit the content already covering that query. Do you have a page that should be cited for this query type? If yes, why is it not being cited? Common reasons: the page renders JavaScript-only (invisible to crawlers), the page is gated, the page lacks clear extractable answers, or the page is too thin. Fix the highest-value existing pages before creating new ones.

3. Brief and build the missing content. For query types where no existing content is a candidate for citation, brief the specific page needed. Format should match the citation pattern for that query type — comparison queries need comparison pages, feature queries need documentation, category queries need authoritative opinion content.

4. Measure the impact. After publishing, track the citation rate on the affected queries for 6 to 10 weeks. Model update latency means new content can take 4 to 8 weeks to affect citation rates in training-data-based engines like Claude and ChatGPT without browsing. Perplexity will reflect the change within days of indexing. The ChatGPT citation engineering framework covers how to accelerate the training-data uptake cycle.

The Two-Engineer MVP Build Plan

For a team ready to build the minimum viable citation dashboard, here is a realistic sprint plan.

Sprint 1 (Week 1-2): Data collection foundation. - Set up API keys for OpenAI, Anthropic, Perplexity, and Google AI Studio - Build the Python runner with async execution across all four API-based engines - Build the PostgreSQL schema (raw response table + aggregated metrics table) - Write the brand-mention parser with direct-name, URL-reference, and possessive detection - Run the first weekly batch manually and verify output

Sprint 2 (Week 3-4): Automation and visualization. - Schedule the weekly runner via cron or a job queue - Set up the Metabase (or Looker Studio) dashboard with the three core views - Build the three alert types in Slack webhook format - Add model-version logging and run-metadata capture

Sprint 3 (Week 5-6): Prompt set refinement and Copilot. - Refine the prompt set based on first four weeks of data — add query types that are producing high-variance results, remove query types that are not differentiating - Add Playwright-based Copilot simulation for the browsing-enabled Microsoft responses - Document the full system for handoff and maintenance

Total engineering investment: 6 to 8 weeks at one to two engineers, part-time. Ongoing maintenance is 2 to 4 hours per week. The marginal cost of running the system — primarily API fees — is $20 to $60 per month for a standard 400-query weekly set across four API-based engines.

What Commercial AEO Tools Measure — and What They Miss

Several commercial tools now offer AI citation tracking as a product, including Profound, Otterly, and the AEO features in the updated Ahrefs and Semrush platforms. Understanding what these tools cover — and where they fall short — determines whether a custom build is necessary or whether a commercial tool can meet your needs.

Commercial tools generally cover ChatGPT citation tracking well, because OpenAI's API makes this tractable at scale. Profound's core product runs prompt sets against ChatGPT and provides share-of-model reporting that is directly comparable to the architecture described here. For teams whose citation intelligence needs are primarily ChatGPT-focused, Profound is a reasonable starting point with faster time-to-value than a custom build.

The gaps in commercial tools, as of May 2026: - No commercial tool provides full, production-grade Perplexity tracking with source-URL parsing - Gemini tracking in commercial tools is generally limited to a subset of models and does not log model version metadata - None provide Claude citation tracking with entity-association analysis - None provide Copilot tracking at all - Custom prompt set design is limited — most tools run their own standardized prompt libraries, not prompts calibrated to your specific competitive category - The raw response text is not accessible for export or retrospective re-parsing

For most B2B brands in competitive categories, the commercial tools are a useful starting point for the first 3 to 6 months of a measurement program. Teams that discover significant citation gaps — or that are competing in categories where the commercial tool's prompt library does not adequately cover their queries — benefit from the custom architecture described here. The two are not mutually exclusive: many teams use a commercial tool for ChatGPT coverage and a custom build for the other engines.

From Dashboard to Decision: The Citation Intelligence Flywheel

The dashboard is not the destination. The destination is a content investment strategy that is continuously calibrated by citation data — a feedback loop where measurement drives creation, creation changes citation rates, and changed citation rates inform the next round of measurement.

Teams that build this flywheel gain a compounding advantage over teams that do not. A brand with 18 months of weekly citation data knows exactly which query types it owns, which it contests, and which are controlled by competitors. It knows how long its content investments take to move the needle on each engine. It can prioritize its editorial budget against a citation gap map rather than a keyword volume estimate. That precision makes every content dollar more effective than it would be without the data.

The brands that will dominate AI search in 2028 are largely determined by the citation share they accumulate in 2026 and 2027. The models that will be widely deployed in 2028 are being trained on content published now. The content that will be in that training data is the content that is getting cited by the current generation of models, because high-citation content tends to be high-authority content — the kind that gets republished, linked, and referenced in the documents that end up in training sets.

Building the measurement infrastructure is the prerequisite for everything that follows. The AEO citation tracking playbook covers the strategic framework. This article has covered the technical build. The next step is starting Sprint 1.

Takeaway: Multi-engine AI citation tracking is a 6 to 8 week engineering project that costs $20 to $60 per month to run and produces measurement data that no commercial analytics tool provides comprehensively. The architecture is a prompt runner, a brand-mention parser, a time-series database, and a visualization layer — straightforward components that a one to two person engineering team can assemble from documented, open-source parts. The teams that build it in 2026 will have 12 to 18 months of citation trend data by the time competitors begin measuring — and in a game where share-of-model is the leading indicator of pipeline, that measurement advantage compounds into a durable competitive lead.

Frequently Asked Questions

How do you track AI citation rates across ChatGPT, Perplexity, and Claude simultaneously?

Tracking citation rates across multiple AI engines requires a purpose-built architecture because no single analytics platform reads all five major engines. The core approach is a prompt-runner layer that submits a standardized set of queries to each engine's API (or a controlled scraping layer where APIs are unavailable), logs the full text responses, and passes each response through a brand-mention parser that detects your target entity and competitors. ChatGPT and Claude offer official APIs that make automated querying straightforward. Perplexity offers an API in beta. Gemini is available via Google's AI Studio API. Microsoft Copilot requires web-level simulation because its API does not expose raw citation text. Each engine response is stored in a time-series database alongside query metadata, engine version, and response timestamp. Aggregating across engines requires normalizing brand mention strings — accounting for abbreviations, misspellings, and synonym references — before rolling up into a unified share-of-citation metric. Most teams run this at daily or weekly frequency to build trend data.

What is the best data architecture for storing and comparing AI search citation data?

The most practical data architecture for multi-engine citation tracking combines a document store for raw responses with a relational layer for aggregated metrics. Raw API responses — the full text of each AI answer — should be stored in a document database such as PostgreSQL JSONB, MongoDB, or BigQuery JSON columns. This preserves the full text for retrospective analysis as your parsing logic improves. Aggregated citation scores (brand mentioned: yes/no, brand position in response, competitor mentions) are stored in a normalized relational structure with columns for query ID, engine, date, brand, and binary or positional citation flag. A time-series dimension is essential: citation rates move over time as models are updated, and you need at least 90 days of baseline data to detect statistically meaningful trends. Many teams layer Metabase, Looker, or a custom React dashboard on top of this structure. The schema should be designed from day one to support multi-engine comparisons — a separate row per engine per query per date is the most flexible unit of analysis.

How large a prompt set is needed for statistically meaningful AEO tracking?

The minimum viable prompt set for statistically meaningful AEO tracking is 50 queries per category, run weekly. At that volume you have enough data to detect citation share changes of 10 percentage points or greater with reasonable confidence. For a B2B SaaS company competing in a specific category, 50 prompts covers the main head-term query, 10 to 15 comparison and alternatives queries, 15 to 20 use-case or feature queries, and 10 to 15 competitor-name queries where you want to appear. Larger programs targeting 5 or more categories should aim for 200 to 400 total prompts per weekly run. The prompt set design matters as much as the volume: prompts need to vary in phrasing, specificity, and intent to avoid overfitting to a narrow query type. A single query phrased five different ways produces more useful signal than five unrelated queries on the same topic. Teams that run fewer than 30 queries typically see too much variance week-to-week to distinguish real trend from noise.

Can you use the ChatGPT and Claude APIs to measure AEO automatically?

Yes, both the ChatGPT (OpenAI) and Claude (Anthropic) APIs support automated querying for AEO measurement, with important caveats. The OpenAI API gives you access to GPT-4o and GPT-4 Turbo responses, but the API responses do not include browsing or real-time web search by default — they reflect the model's training data, not live web citations. For measuring training-data citation presence, this is fine. For measuring real-time Perplexity-style citation behavior, you need the ChatGPT web interface or the API with the web search tool enabled. Claude's API via Anthropic is similarly straightforward for training-data citation measurement. Rate limits are the main operational constraint: at 200-400 queries per week across five engines, you will stay well within standard API tier limits. Budget is modest — at OpenAI's current GPT-4o pricing, a 400-query weekly run costs roughly $8 to $15 depending on response length. The larger cost is engineering time for the parsing and storage layer, not API fees.

What is the minimum viable AEO tracking setup for a team with limited engineering resources?

The minimum viable AEO tracking setup for a resource-constrained team is a spreadsheet-driven manual process supplemented by one lightweight automation. Start with a Google Sheet with columns for query text, engine, date, brand mentioned (yes/no), brand position (first/second/third/not mentioned), and notes. Run 20 to 30 queries manually across two or three engines each week, recording results by hand. This gives you a real-time baseline with zero engineering cost. Once you have 4 to 6 weeks of baseline data and can justify the investment, add a single Python script that automates the OpenAI and Claude API calls and appends results to the sheet via the Google Sheets API. This takes roughly 8 to 12 hours of engineering time to build and reduces the weekly manual work by 60 to 70 percent. The full custom dashboard with multi-engine automation, a time-series database, and visualization layer is a 2 to 4 week engineering project — worthwhile for teams tracking 3 or more categories or competing in high-stakes categories where citation share is a primary growth lever.