AEO Budget Benchmark 2026: 11% of Marketing Spend, Climbing Fast

Causal Impact, GeoLift, and ZIP-level holdouts give marketing leaders the first defensible answer to whether AEO investment actually moves revenue.

By Tessa Wright, Enterprise & Revenue · May 26, 2026 · 17 min read

When Booking.com's marketing science team published their geo-experimentation framework in 2023, the industry got a glimpse of how a $20B revenue business proves marketing causality without user-level randomization. The methodology they described — matched-market geo holdouts analyzed with Bayesian structural time-series — has since become the default measurement layer for any channel that cannot be A/B tested at the click level. AEO is that channel. You cannot randomize which users see your citation in ChatGPT. You cannot cookie a Perplexity answer. The only credible way to prove answer-engine optimization moves revenue is a ZIP-code or DMA geo experiment, and 2026 is the year operators finally have the open-source and SaaS tooling to run one without a data-science PhD.

This piece walks through the methodology Tessa's team has run across nine AEO geo-experiments for B2B SaaS, retail, and multi-location brands since Q3 2025. We cover the math, the tooling tradeoffs (Google Causal Impact, Meta GeoLift, Eppo, Statsig), the matched-market selection rubric, a worked example, and the playbook a CMO can hand to a marketing analyst on Monday morning.

Why AEO measurement broke the A/B testing playbook

The classic digital-marketing experiment splits users into treatment and control via a cookie, a logged-in user ID, or an ad-platform audience. That works when the channel respects user identity: paid social, paid search, email, on-site personalization. It collapses when the channel is an LLM answer.

ChatGPT does not know whether the user asking "best CRM for early-stage startups" was assigned to test or control. Perplexity does not honor your randomization scheme. Google's AI Overviews and Gemini grounding pull from a single global index — there is no per-user treatment slot to manipulate. The same fundamental problem broke TV-attribution measurement in the 1980s and broke organic-SEO measurement in the 2000s. The solution then is the solution now: hold out a geography.

A Nielsen marketing-mix-modeling primer reframed the case in 2022: geo experiments deliver causal estimates with less than half the data volume needed for user-level tests, and they work for any channel where geography is observable. AEO clears that bar. Citation behavior in ChatGPT search, Perplexity, Google AI Overviews, Claude, and Gemini all factor in user IP, account-declared location, and query-language signals to weight local results — and that geo-conditioning is what makes ZIP-code holdouts work.

The deeper measurement question is what counts as the "intervention." For AEO, the treatment is usually a bundle: new pillar content, schema upgrades, llms.txt publication, citation-pattern engineering on Reddit and YouTube, original-research data drops, or local schema and Google Business Profile work for multi-location brands. Geo experiments do not require you to isolate each component; they require you to apply the bundle in test markets and not in control markets, then attribute the aggregate lift. Component-level attribution is a separate problem — see our Multi-touch attribution deep-dive for that frame.

Causal Impact: the statistical engine

Google Research's CausalImpact R package, introduced in Kay Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven Scott's 2015 Annals of Applied Statistics paper, is the workhorse statistical method for this whole category. It uses Bayesian structural time-series (BSTS) to project a synthetic counterfactual for the test market based on pre-period correlation with control markets, then computes the posterior distribution of the difference between observed and projected outcomes.

In plain language: the model learns what the test ZIP code's revenue trend looked like before the intervention, finds control ZIPs whose pre-period trend mirrored the test ZIP's trend, then asks "given how the controls evolved during the intervention window, what would the test ZIP have done if untreated?" The gap is the causal estimate.

The BSTS specification matters because it handles the three things that wreck naive difference-in-differences:

Seasonality. Hearing-aid leads spike in fall (Medicare open enrollment); SaaS demos spike in January (budget cycles); restaurant traffic spikes on Fridays. BSTS decomposes seasonal cycles automatically.
Trend. A growing or declining market in the pre-period extrapolates appropriately into the post-period.
Covariate shifts. The control series adjusts for shared shocks (a national news cycle, a holiday week, a competitor outage).

The package also outputs a clean point estimate (e.g., "+12.4% lift, 95% credible interval [+6.1%, +18.8%]") and a posterior probability that the effect is positive. CFOs love the probability statement because it answers their actual question: "How confident are you that this worked?"

The CausalImpact Python port (causalimpact on PyPI) is feature-equivalent for most workloads. PyMC and tfp.sts give you more flexibility if you outgrow the canned package.

Meta GeoLift: open-source for power analysis and market selection

Meta's GeoLift R package (released by Meta's Marketing Science team in 2022) takes a different angle. Instead of running the BSTS model directly, GeoLift uses synthetic control methods (Abadie, Diamond, Hainmueller 2010) to construct a weighted combination of control markets that best matches the test market's pre-period trajectory, then runs the test against that synthetic control.

Where GeoLift shines is in the planning phase. Its power simulator runs Monte Carlo simulations across historical data to tell you, "If you treat 5 of your top-20 ZIPs for 6 weeks, you'll detect a true 10% lift with 80% power." That is the question every marketing analyst needs answered before the test starts, and it is the question Causal Impact does not directly answer.

The standard AEO-team workflow is to use GeoLift for design (market selection + power analysis), then run the actual inference in either GeoLift's analyze function or Causal Impact. Both deliver similar point estimates for clean experiments; GeoLift can have slightly tighter intervals on highly heterogeneous markets because synthetic control reweights rather than averages.

Eppo and Statsig: when to graduate to SaaS

Eppo's geo-experiment documentation and Statsig's geo-test guide describe what enterprise teams pay for: managed metric pipelines, automated matched-market selection across hundreds of geos, multi-team workflow with experiment registry, and audit-ready reporting that survives a board presentation.

The math under the hood is the same — BSTS, synthetic control, or augmented variants. What you pay for is workflow:

Eppo (Series B, raised $40M Series B in 2023 per Eppo's announcement) ships pre-built integrations with Snowflake, BigQuery, Databricks, and Redshift, and runs experiment analysis on top of your warehouse. Pricing ranges from roughly $1,500/mo for startups to mid-six-figures for enterprises. They publish geo-experiment recipes that an analyst can clone in a half-day.
Statsig (acquired Vercel users' growth team's attention with their 2023 launch of geo experiments) offers a similar warehouse-native architecture with stronger product-analytics tooling. Their pricing is more usage-based.
Hightouch and GrowthBook are adjacent options; GrowthBook is open-source and the cheapest path to a SaaS-grade UI.

The decision threshold I use with clients: if you're running fewer than four concurrent geo-experiments and your data team is comfortable in R or Python, the free stack (GeoLift + Causal Impact + warehouse SQL) delivers identical statistical conclusions. Above that scale or below the data-science bench depth needed to maintain it, Eppo and Statsig pay for themselves in analyst hours saved within two quarters.

Matched market selection: where most experiments fail

The model is the easy part. The hard part is choosing which ZIPs or DMAs to treat and which to use as controls. Bad market selection wrecks the experiment no matter how rigorous the inference.

Three criteria drive a defensible match:

1. Pre-period correlation. The control set must track the test set in the pre-period. Pearson correlation above 0.85 on the primary metric across an 8-12 week pre-window is the practical threshold. Below that, the synthetic counterfactual becomes too noisy.

2. Comparable population and economic profile. Demographic skew (income, age, urban/rural mix) matters less than people think when the model is well-fit, but extreme mismatches (treating Manhattan, controlling with rural Wyoming) introduce structural bias that the model cannot fully correct.

3. No spillover. If the test ZIP and the control ZIP share a media market or a labor market, AEO interventions can spill over (a citation that appears for users in test-ZIP also serves users in adjacent control-ZIP). Spillover biases the estimate toward zero. Use geographic buffers of at least one DMA or a 50-mile radius for local-AEO work.

Here is the matched-market table from a recent restaurant-chain AEO experiment we ran in Q4 2025. The brand operated 240 locations across 18 DMAs; we selected 6 test DMAs and 12 controls.

Test DMA	Control DMA 1	Control DMA 2	Pre-period correlation	Population (M)	Median HHI ($k)
Charlotte	Raleigh-Durham	Nashville	0.91	2.7	68
Phoenix	Tucson	Las Vegas	0.89	4.9	71
Indianapolis	Cincinnati	Columbus OH	0.93	2.1	64
Portland OR	Sacramento	Seattle	0.87	2.5	79
Tampa	Orlando	Jacksonville	0.94	3.2	62
Minneapolis	Milwaukee	Kansas City	0.88	3.7	76

The high pre-period correlation in this table (mean 0.90) is what made the post-period inference credible. We selected the controls using GeoLift's MarketSelection function, which scores candidate controls on correlation, scale, and dynamic time warping distance.

A worked example: SaaS AEO lift in test markets

Here is a redacted version of an AEO geo experiment we ran for a US-only B2B SaaS company (Series C, ARR $32M, ICP: mid-market HR teams) between September and December 2025.

Hypothesis: A bundled AEO intervention (12 new pillar articles, schema upgrade, llms.txt publication, three original-research data drops, founder LinkedIn cadence increase) would lift inbound demo-request volume in the treated metro areas without affecting paid-channel performance.

Design: 8 test metros, 24 control metros (US-only). Pre-period September 1-30, 2025. Treatment window October 1 - November 30, 2025. Post-treatment measurement December 1-21, 2025 (to capture lagged citation effects).

Primary metric: Demo requests with a company billing address in the metro.

Secondary metrics: Branded search volume (Google Trends + Glimpse), AI-referred sessions (Profound), pipeline created.

Results from CausalImpact:

Demo requests: +18.2% lift, 95% CI [+9.1%, +27.4%], posterior probability of positive effect: 0.998
AI-referred sessions: +127% lift, 95% CI [+78%, +186%], posterior probability: > 0.999
Branded search: +6.1% lift, 95% CI [+1.2%, +11.0%], posterior probability: 0.991
Pipeline created: +14.4% lift, 95% CI [+4.8%, +24.1%], posterior probability: 0.985

The pipeline result is the one the CFO cared about. The 14.4% lift, applied across the eight test metros' baseline pipeline of $3.4M/quarter, translated to $490k of incremental pipeline causally attributable to the AEO investment, against a treatment cost of $186k (content team + schema work + LinkedIn budget). The 2.6x quarterly pipeline ROI was the data point that funded the FY2026 AEO budget at 4x its FY2025 level.

Two caveats worth flagging. First, the +127% AI-referred lift looks gaudy because the baseline was small (sessions from openai.com, perplexity.ai, anthropic.com referrer headers); even the absolute number was only 2,800 incremental sessions/month. Second, branded search lift (+6.1%) is a downstream effect — users encounter the brand in an AI answer, then search the brand on Google to verify — and it correlates strongly with eventual conversion. We've now seen this pattern in five of six AEO geo-experiments: AI-referred sessions and branded search move first, demo requests follow with a 2-3 week lag, closed revenue follows with a further 4-8 week lag. That lag structure has implications for how long you run the test, which we cover below.

The AEO geo-experiment playbook

Run a geo experiment for AEO with the same rigor you'd run a Phase III clinical trial. Most of the value is in the pre-registration of the design; running the analysis after the data is in is the easy part.

1. Pre-register the hypothesis and primary metric. Write down before treatment starts what you expect to move, by how much, and what counts as a "win." A doc dated and circulated before the test starts kills hindsight bias. State the primary metric, the secondary metrics, the test markets, the control markets, the treatment window, and the analysis method. We use a one-page Notion template; Eppo and Statsig have built-in registry workflows.

2. Run a power analysis with GeoLift. Open GeoLift's power simulator with your last 12 months of geo-level data, your candidate test markets, and a range of lift hypotheses. The output tells you minimum-detectable-effect at 80% power for a given test duration. If the answer is "you'd need a 22% lift to detect anything," your test is underpowered — either expand the test set, lengthen the window, or pick a more sensitive metric.

3. Select matched controls using pre-period correlation. Use GeoLift's MarketSelection or write a SQL query that ranks candidate controls by Pearson correlation with each test market on the primary metric across the pre-window. Pick controls with correlation > 0.85, scale within 2x of test scale, and no media-market overlap.

4. Apply the bundled AEO intervention in test markets only. For a national-brand B2B test, this typically means hyper-local schema, ZIP-specific landing pages, geo-targeted founder LinkedIn content, local PR mentions, and Google Business Profile work in test markets only. For multi-location brands, it's location-page schema upgrades and llms.txt publication only on test-DMA URLs. Discipline matters — any leakage into control markets biases the estimate downward.

5. Run the treatment window for 6-12 weeks. Less than 4 weeks of treatment is almost always underpowered because of citation lag. We aim for an 8-week treatment + 4-week stable post-window.

6. Run CausalImpact or GeoLift inference. Pull daily metric data by geo, fit the model, generate the impact report. Both packages produce publication-ready charts and credible intervals automatically.

7. Cross-validate with a placebo test. Before reporting, run the same analysis treating a random control market as if it were the test market. If you find a "significant effect" on a market that received no treatment, your model is over-fit or your matched controls are leaky. Iterate market selection until placebos consistently show null.

8. Report with credible intervals, not just point estimates. "We saw +18% lift, 95% CI [+9%, +27%]" beats "we saw +18% lift" every time in a board deck. CFOs respect the uncertainty quantification more than a confident single number.

Local AEO: where ZIP codes shine

For multi-location brands (restaurants, dental practices, fitness studios, healthcare clinics, retail stores), ZIP-level rather than DMA-level experiments are usually the right unit. Three reasons:

First, LLM grounding for local queries weights ZIP-level signals heavily. Ask ChatGPT search "best dentist near 02139" and it will return different results than "best dentist near 02140" three miles away, because the local-citation corpus is finer-grained than DMA.

Second, ZIP codes are typically how multi-location-brand revenue is sliced in the CRM — billing ZIP, shipping ZIP, store ZIP. Matching the experiment unit to the data unit removes ambiguity.

Third, you get more sample. A national footprint of 240 locations across 18 DMAs gives you 240 ZIP-level experimental units but only 18 DMA-level units. Statistical power scales with sample size; ZIP-level experiments can detect smaller lifts.

Our Local AEO deep-dive details the local-citation interventions worth bundling into the treatment package. The geo experiment is the measurement wrapper; the interventions themselves are the local AEO playbook.

One caution: at the ZIP level, individual-location idiosyncrasy (a strong store manager, a local PR cycle, a competitor opening across the street) creates noise that DMA aggregation smooths out. Run sensitivity analysis by re-running the inference with the noisiest ZIPs excluded and check whether your conclusion holds.

Combining geo experiments with incrementality holdouts

Geo experiments are one tool in the incrementality toolbox. They complement, rather than replace, two related methods:

User-level holdouts work when the channel does respect user identity (paid search, retargeting, email). For those channels, our AEO incrementality holdout methodology applies directly. Pair a user-level holdout for paid channels with a geo-level holdout for AEO and you have a complete incrementality picture across your media mix.

Marketing mix models (MMM) estimate channel-level contribution across the whole business without any holdout. MMMs are powerful but require 2-3 years of weekly data to fit well, and they tend to under-attribute new channels (AEO is the obvious case). The right pattern is to use geo experiments to calibrate MMM coefficients for AEO — feed the causal estimate from your geo experiment in as a prior in the MMM, anchoring the channel coefficient to a defensible causal number. This is the Robyn workflow Meta's open-source MMM package supports out of the box, and the same pattern works for LightweightMMM and Recast.

The methodology pyramid in 2026 looks like this: geo experiments at the base (causal ground truth, expensive but rigorous), MTA and MMM in the middle (always-on attribution, calibrated by experiments), self-reported attribution at the top (survey "how did you hear about us" signals, cheapest but noisiest). AEO needs all three layers because no single layer answers the full revenue-attribution question.

Common failure modes and how to avoid them

Five mistakes account for roughly 80% of failed AEO geo experiments we've audited:

1. Treatment too short. Two weeks is not enough. Citation lag in LLMs averages 7-21 days from content publication; you need at least 4 weeks of treatment to see any AEO effect, and 6-8 is better. Plan for it upfront.

2. Control leakage. AEO content published "in test markets only" usually leaks because content is global — a blog post indexed by Google is indexed for everyone. The discipline of "test markets only" means targeting (local schema, local pages, local PR) is test-only, not all content publication. Be explicit about which interventions are global (and therefore not testable via geo) versus local-targeted (and therefore testable).

3. Underpowered metric choice. Closed-won revenue at the ZIP level for a B2B SaaS company is often too sparse to test in a 12-week window. Move primary metric upstream — demo requests, qualified pipeline created, even branded search — and use closed revenue as a secondary or confirmatory metric.

4. Ignoring placebo tests. If you don't run a placebo, you don't know whether your model is detecting real effects or artifacts of overfitting. Always placebo-test before reporting.

5. Reporting point estimates without intervals. A bare "+18% lift" gets challenged. A "+18% lift, 95% CI [+9%, +27%], posterior probability of positive effect 99.8%" is bulletproof. Always report the uncertainty.

Tooling stack: a 2026 buying guide

Tool	Best for	Pricing	Notes
Google CausalImpact (R)	Solo analyst, post-hoc inference	Free	The reference implementation. Brodersen et al. 2015.
causalimpact (Python)	Python-native teams	Free	Feature-equivalent port; PyMC for advanced specs.
Meta GeoLift (R)	Power analysis + market selection	Free	Best free option for the design phase.
Eppo	Enterprise multi-team workflow	$1.5k-$200k/yr	Snowflake/BigQuery-native; geo-experiment recipes.
Statsig	Product + marketing combined	Usage-based	Stronger product analytics; growing geo footprint.
GrowthBook	Open-source SaaS-grade UI	Free / $20/seat	Lighter weight than Eppo; geo support newer.
Robyn (Meta)	MMM calibrated by geo experiments	Free	Use geo lift as a Bayesian prior on the AEO channel.
LightweightMMM (Google)	Bayesian MMM	Free	Same pattern; smaller user community than Robyn.

The right starting point for almost every team in 2026 is GeoLift + CausalImpact + Robyn, all free, all maintained by Google or Meta research teams. The SaaS layer is justified once you've outgrown that stack — typically at 4+ concurrent experiments or 10+ analyst hours per week of experimentation work.

What's coming in 2027

Three trends will reshape AEO geo experiments in the next 18 months:

Per-LLM grounding-aware experiments. Different LLMs ground location differently (ChatGPT search heavily weights IP, Perplexity weights account-declared location, Gemini blends both). The next generation of geo experiments will test each grounding source independently. Statsig has hinted at this in their roadmap; Eppo's product team has confirmed they're working on per-LLM segmentation.

Pre-trained synthetic controls. Research from Stanford's Susan Athey's group on "matrix completion" methods will, by mid-2027, allow synthetic-control estimation without needing matched pre-period correlations — you'll be able to run AEO geo experiments on any geography even without historical data. Causal Impact and GeoLift both have research roadmaps pulling in that direction.

LLM-vendor-published geo dashboards. OpenAI, Anthropic, and Perplexity have all hinted at publishing geo-level citation share-of-voice for brands. When that ships (most likely Q3-Q4 2026 for at least one major vendor), geo experiments will have a richer treatment-effect target. Expect citation-share-by-geo to become the standard top-of-funnel metric.

Takeaway: Geo experiments using Google Causal Impact or Meta GeoLift are the only defensible way to prove AEO investment moves revenue, because LLM citation behavior cannot be A/B tested at the user level. The open-source stack — GeoLift for power analysis and market selection, Causal Impact for inference — delivers the same statistical rigor as Eppo or Statsig at zero license cost, and is sufficient for any team running fewer than four concurrent experiments. The hard part is not the math, it is the discipline of pre-registering hypotheses, selecting matched markets with pre-period correlation above 0.85, running treatment for at least six weeks to clear citation lag, and reporting credible intervals rather than point estimates. Operators who run one well-designed geo experiment per fiscal half-year will out-fund their AEO programs against CFOs who otherwise default to "I can't see the ROI." That is the budget unlock.

Frequently Asked Questions

What is a geo experiment for AEO and why use ZIP codes?

A geo experiment for AEO splits a region into matched test and control geographies, applies the AEO intervention (citation work, local schema, llms.txt, content push) to test markets only, then compares outcomes against the synthetic counterfactual built from control markets. ZIP codes are the right unit for local AEO because they roughly map to LLM grounding behavior in tools like ChatGPT search and Perplexity, they are small enough to give a large sample of geographies, and they tie cleanly to most CRM and ad-platform location fields. For national-brand AEO, DMAs (210 in the US) are often a better unit because of higher per-unit volume and lower noise. The output is a defensible point estimate of incremental revenue or sessions, with credible intervals, that survives CFO scrutiny.

How does Google Causal Impact differ from a regular A/B test?

Google's Causal Impact R package, released in 2014 by Kay Brodersen and colleagues at Google Research, fits a Bayesian structural time-series model to pre-intervention control data, then projects what the test market would have done absent the intervention. The difference between observed and projected is the causal effect, with full posterior credible intervals. Unlike a standard A/B test, Causal Impact does not require user-level randomization, which is impossible for AEO because LLM citation behavior is not user-randomizable. It works for organic channels, brand marketing, and any intervention you cannot randomize at the click level. The tradeoff is that the inference is only as good as the control series, which is why matched-market selection matters more than the model choice itself.

Can I run a geo experiment on a tight budget without Eppo or Statsig?

Yes. The open-source stack is sufficient for most operators. Install the CausalImpact R package (or its Python port), pull daily revenue and sessions by ZIP or DMA from your warehouse, choose 5 to 10 matched test markets and 20 to 40 control markets using pre-period correlation, and treat one geo with the AEO intervention for at least four weeks. Meta's GeoLift R package adds power analysis and market selection automation and is also free. Paid platforms like Eppo and Statsig add multi-team workflow, automated power calculations, and PR-grade reporting; they justify their cost above roughly $10M ARR or for teams running more than four concurrent experiments. Below that scale, the open-source path delivers identical statistical rigor at zero license cost.

How long does a ZIP-code AEO geo experiment need to run?

Plan for a six-to-twelve-week test window with a four-to-eight-week stable pre-period for model fitting. Four weeks is the practical minimum for treatment because LLM citation indexes lag content publication by 7 to 21 days for most major engines, and you need at least two stable post-citation weeks for the conversion data to settle. Underpowered tests that run two weeks are the most common mistake we see; they almost always fail to reject the null even when the intervention worked. Run a power analysis upfront using GeoLift's power simulator or Causal Impact's posterior predictive check. The minimum detectable effect at the geo level is typically 8 to 15 percent lift, which is meaningful for local-AEO work but too coarse to detect 2 to 3 percent changes.

What metrics should I measure in an AEO geo experiment?

Three layers. Top-of-funnel: branded search volume per geo (Google Trends or paid Glimpse data), direct traffic, and citation share-of-voice tracked by Profound, Otterly, or Peec. Mid-funnel: organic sessions, AI-referred sessions (from utm and referrer parsing for OpenAI, Anthropic, Perplexity), and lead-form submits. Bottom-funnel: pipeline created, opportunities, and closed-won revenue tied to geo via CRM billing-state or shipping-ZIP field. The Causal Impact model is run separately for each metric, and you typically expect lifts to compound down the funnel with longer lag. For local-AEO work, store visits via Google Business Profile insights and Apple Business Connect actions are also worth including, since LLM-driven discovery often resolves in offline foot traffic that web analytics cannot capture.