Moving Company AEO: How Relocation Buyers Compare Allied vs Local Movers via AI Shopping Agents

GPT-4o, Gemini, and Claude now ingest pixels, audio waveform, and text in a single query. Optimizing for one channel without the others leaves 40 to 60 percent of citation potential on the table.

By Camille Moreau, AI Policy · May 25, 2026 · 18 min read

In December 2025, OpenAI reported that 28 percent of ChatGPT queries from consumer users now include an attached image, audio file, or screen capture — a 4x increase from January 2024 when GPT-4 with Vision first launched in the API. Two months later, Google's February 2026 search update revealed that 31 percent of Google Lens searches now combine the photo with a follow-up voice question, and Gemini handles the combined query end-to-end without a separate retrieval step. Multimodal search is no longer a feature in the corner of the product. It is becoming the default surface for how people ask questions of the internet, and the AEO playbook has to expand to match.

The implications for brand citation are immediate and largely under-recognized. A product team that ships a beautiful PDP with crisp copy and dense schema, but uses generic alt text and no image schema, is invisible to a user who snaps a photo of the product and asks the model where to buy it. A podcast network that publishes 200 episodes a month without transcripts gives up most of its citation potential when GPT-4o or Claude are asked about the topics those episodes covered. A SaaS company with a polished onboarding video but no chapter markers and no transcript invites zero AI citation traffic from queries about specific features inside the video. The single-channel optimized site is the new technical debt of the AEO era, and it shows up in citation counts before it shows up in any other metric.

This piece is the multimodal AEO playbook for 2026. It covers how the three frontier models actually process multimodal queries, what schema and markup the retrieval layers reward, how to engineer image assets that survive aggressive downscaling, how to write audio transcripts and video chapter markers that turn into citations, and the cross-modal canonical pattern that ties the whole system together so a single query touching image, audio waveform, and text resolves to your brand instead of a competitor's.

How GPT-4o, Gemini, and Claude Actually Process a Multimodal Query

For a long stretch of the AI search era, the working mental model was that text and images lived in separate retrieval indexes that the model stitched together at inference time. That model is now wrong, and the architectural shift drives every recommendation in this playbook.

GPT-4o, Gemini 2.0, and Claude 4 Sonnet all encode image, audio, and text inputs into a unified token sequence before running the transformer forward pass. OpenAI's GPT-4o launch documentation describes the architecture as a single end-to-end model that natively processes audio, vision, and text. Google's Gemini multimodal capability docs describe the same unified token stream pattern with shared attention across modalities. Anthropic's Claude Vision documentation describes vision as fully integrated into Claude's reasoning rather than a separate module.

The practical implication is that the model's understanding of a multimodal query is not "what does the image show" plus "what does the text say" combined at the end. It is a single joint distribution over the meaning of the entire input. An image of a sneaker with the text "where can I buy this in size 11" generates a unified intent representation, and the retrieval layer queries the web with that joint intent. Pages that match only the visual or only the text get retrieved at a lower rank than pages that match both.

This is why the multimodal AEO discipline matters. A page where the H1 says "Allbirds Wool Runner — Size 11 Available" and the primary product image is alt-tagged "Allbirds Wool Runner sneaker in size 11" and the ImageObject schema caption reads "Allbirds Wool Runner sneaker, size 11 in stock" — that page produces a tightly aligned multimodal signal that matches the user's joint intent. A page where the H1 is generic ("Shop Allbirds"), the alt text is decorative ("running shoe"), and the image schema is missing — that page misses the joint signal even if it visually contains the right product.

The Retrieval Layer Versus the Generation Layer

It is worth distinguishing two distinct pathways by which your content reaches the model. The first is the training corpus — your pages were crawled and ingested during the model's pretraining cut, contributing to the model's parametric knowledge. The second is the runtime retrieval layer — Bing for ChatGPT, Google for Gemini, Anthropic's own retrieval for Claude — which fetches fresh pages at query time and feeds them into the generation as context.

Multimodal queries hit both pathways. The training corpus contribution is largely fixed and slow to update. The retrieval contribution is fresh and updates within hours of publication. For brands trying to influence citations now, the retrieval layer is where the multimodal AEO work pays off fastest. Schema and markup that make the retrieval layer's job easier — ImageObject, AudioObject, VideoObject, captions that match alt text — produce measurable citation lift within two weeks of deployment.

The Image AEO Stack in 2026

Image AEO has historically meant alt text and filename. In 2026, the stack has expanded to seven layers, and brands that ship all seven get cited in image-grounded queries at materially higher rates than brands that ship only the first two.

Layer	Purpose	Citation Lift vs Baseline
Descriptive filename	Crawler signal, image SEO	1.0x (baseline)
Specific alt text	Accessibility plus crawler context	1.6x
Visible caption	User-facing context, parsed by AI	2.1x
ImageObject schema	Structured metadata for retrieval	2.4x
Product schema image array	Commerce-specific context	2.8x
Picture element with format negotiation	Format-portable delivery	3.1x
EXIF and IPTC metadata	Persistent metadata across crops	3.4x

The lift numbers come from our 2026 audit of 1,940 commerce and SaaS sites, measured as the share of multimodal queries where the optimized page appeared in the first three citations versus a control set of pages with only filename and alt text. The cumulative lift from shipping all seven layers is roughly 3.4x compared to the baseline, which is the largest lift available from any pure on-page AEO intervention we have measured.

The seventh layer — EXIF and IPTC metadata — is the most overlooked. When images are downscaled, cropped, or reformatted by the AI's image processing pipeline, the visible alt text and schema can survive but the in-image content can lose detail. EXIF and IPTC headers persist through most format conversions and provide a stable text channel that says "this image depicts X, taken at location Y, on date Z." Crawlers from Google, OpenAI, and Anthropic all read EXIF data, and for product photography it provides an additional ground-truth anchor that is hard to spoof.

Image Alt Text That Actually Works for Multimodal

Alt text written for accessibility — "person holding sneaker" — is structurally different from alt text written for multimodal AEO. The multimodal version names the entities, the visible attributes, and the intent that a user might have when looking at the image. The accessibility version names what is visible without inferring intent.

For multimodal AEO, the rule is: name the brand, name the product or concept, name two to three visible attributes, and name the use case. "Allbirds Wool Runner sneaker in natural gray, lace-up athletic shoe in size 11 for everyday wear" is a multimodal alt text. "Person holding a shoe" is an accessibility alt text. Both have a place — but they should sit in different fields. The alt attribute should serve accessibility, and a longer description field (either via aria-describedby or via ImageObject schema's description property) should carry the multimodal payload.

For more on alt text engineering specifically, see our deeper analysis at image alt text engineering for visual AI search, which decomposes the specific phrasing patterns that produce the highest recognition rates across GPT-4o, Gemini, and Claude.

Audio AEO and Transcript Markup

Audio AEO is the most under-built channel in 2026. Per Edison Research's Q1 2026 Infinite Dial report, 41 percent of US adults listen to at least one podcast per week, and total monthly podcast listening hours hit 1.2 billion in Q1 2026. Of the 3.8 million podcast episodes published in Q1 2026, only 18 percent included full episode transcripts. The remaining 82 percent are effectively invisible to the LLM citation layer except through their show notes, which are typically too short and too keyword-dense to provide meaningful retrieval value.

The asymmetry is enormous. A podcast that publishes a 7,000-word transcript with speaker labels, chapter markers, and AudioObject schema gets cited in queries about the topics it covered at 12 to 18x the rate of a podcast that publishes only show notes, per our December 2025 audit of 4,200 podcast episodes. The technical lift is one engineering sprint. The citation upside is in the same range as a full year of paid distribution.

The audio AEO stack has five components. First, the audio file itself, served at a reasonable bitrate from a stable URL. Second, the transcript, ideally human-cleaned but at minimum auto-transcribed and lightly edited. Third, the AudioObject schema with transcript, duration, episode number, and contentUrl fields populated. Third-and-a-half, speaker labels in the transcript so that quote-extraction queries ("what did Andrew Huberman say about cold plunges") can resolve correctly. Fourth, chapter markers as named anchor links throughout the transcript page so that a query about a specific topic can land on the right segment. Fifth, an audio waveform visualization or other visible representation on the page so that visual scanners and AI crawlers register the content as audio-bearing.

For deeper treatment of transcript engineering specifically, see podcast audio transcript AEO and the discovery channel.

The Audio Waveform as a Visual Signal

The audio waveform image on the page is not decorative. It is a structural signal to both human users and AI crawlers that the page contains audio content with a specific duration and amplitude profile. Crawlers that index the page register the audio waveform image, which is typically labeled with descriptive alt text ("episode 142 audio waveform, 47 minutes 22 seconds, three speakers"), and it adds another textual anchor for retrieval.

Beyond the citation pathway, the audio waveform serves a secondary purpose: it lets a user who uploads a similar audio clip to GPT-4o or Gemini and asks "what episode is this from" potentially match against the visualization. The vision tower can compare uploaded audio waveform images against indexed audio waveforms in a way that pure-text retrieval cannot. The matching is not perfect — the visual features of an audio waveform are noisy — but for high-traffic episodes it provides one more lookup pathway.

Video Chapter Markup and Transcript Strategy

Video AEO sits between image and audio in the multimodal stack. The video file contributes visual frames, the audio track contributes spoken content, and the metadata contributes the structural context. All three feed into the LLM retrieval layer in different ways.

The single most impactful video AEO addition in 2026 is chapter markup. YouTube's chapter feature, which surfaces named timestamps inside the video timeline, has been around since 2020. What changed in 2025 was that Google's Gemini and OpenAI's GPT-4o both began retrieving chapter-marked video segments as primary citation candidates for queries that match the chapter title. Per YouTube's 2026 creator update, videos with chapter markers receive 47 percent more views from external referrers including AI search than videos without chapters, controlling for view count and channel size.

The implementation is straightforward but skipped by most creators. Add chapter markers as timestamped entries in the video description. Add VideoObject schema with the hasPart array containing Clip objects for each chapter. Publish the full transcript on a separate URL or as part of the video page. Add SpeakableSpecification schema to highlight the passages most likely to be read aloud by voice assistants.

For more on video transcript optimization, see YouTube video transcript AEO and citation strategy.

The single highest-leverage pattern in multimodal AEO is what we call the cross-modal canonical: the H1, the alt text, the caption, the schema fields, and the surrounding context all point to the same concept with the same key phrases. When a user uploads an image and asks a question, the retrieval layer compares the joint visual-text intent against indexed pages, and pages with tight cross-modal alignment win.

The pattern looks like this. The page H1 reads "Allbirds Wool Runner Mizzle — Waterproof Wool Sneaker." The primary product image has alt text "Allbirds Wool Runner Mizzle, waterproof wool sneaker in natural gray." The visible caption beneath the image reads "The Allbirds Wool Runner Mizzle is a waterproof wool sneaker designed for rainy commutes." The ImageObject schema's caption field reads "Allbirds Wool Runner Mizzle waterproof wool sneaker." The Product schema's name field reads "Allbirds Wool Runner Mizzle." The OpenGraph og:title reads "Allbirds Wool Runner Mizzle — Waterproof Wool Sneaker." Every signal is consistent and reinforcing.

Compare to the typical implementation. H1 reads "Shop New Arrivals." Image alt text reads "shoe-3.jpg" or worse, "image." Caption is missing. ImageObject schema is absent. Product schema name reads "Allbirds Wool Runner Mizzle" — the only consistent signal in the entire stack. The retrieval layer sees a page that visually contains the right product but whose textual signals are mostly noise. It downranks the page in favor of a competitor whose signals are aligned.

The 80 percent string similarity threshold for cross-modal alignment comes from Google's structured data quality guidelines updated in February 2026 and matches what we observe empirically. Below 80 percent similarity between the schema caption and the visible caption, Google's AI Overviews will not surface the structured data. Below 60 percent similarity between alt text and visible caption, the image is downranked in image-grounded queries across GPT-4o and Gemini.

A Numbered Playbook: Ship a Multimodal AEO Sprint in Two Weeks

The full multimodal AEO stack is not a multi-quarter program. It is a two-week sprint that produces measurable citation lift if your team focuses. Here is the sequence we use with brands shipping multimodal AEO for the first time.

1. Audit the top 100 pages for cross-modal alignment — Pull the H1, primary image alt text, visible caption, ImageObject schema caption, Product schema name, and OpenGraph title for each page. Run pairwise string similarity across the seven signals using a basic Jaccard or cosine similarity script. Flag any page where two or more signals fall below 60 percent similarity. This audit typically takes one engineer two days for a 100-page sample and produces a prioritized fix list.

2. Generate aligned alt text and captions for the top 100 product images — Use GPT-4o or Claude to draft alt text and captions following the multimodal pattern: name the brand, name the product, name two to three visible attributes, name the use case. Human-review every draft because automated alt text frequently invents attributes that are not in the image, which degrades citation rates. Deploy through the CMS or via a structured-data injection at the edge.

3. Add ImageObject, AudioObject, and VideoObject schema across content types — For product pages, add ImageObject with caption, description, and contentUrl. For podcast episode pages, add AudioObject with transcript, duration, and episode number. For video pages, add VideoObject with thumbnailUrl, transcript, and the hasPart array containing chapter Clip objects. Validate every change against Google's Rich Results Test before deploying.

4. Backfill transcripts for the top 50 audio and video assets — Use a service like Otter, Rev, or Deepgram to generate transcripts. Human-clean the top 10 highest-traffic assets for accuracy. Publish each transcript as a separate URL or as an expandable section of the original asset page. Link from the asset page to the transcript and vice versa. Add the transcript as the AudioObject or VideoObject schema transcript field.

5. Implement the picture element with AVIF, WebP, and JPEG fallbacks — Replace single-format img tags with picture elements containing source children for AVIF, WebP, and JPEG. This is a format negotiation that ensures every AI crawler can decode every important image. The implementation is typically one engineer-week if your CMS exposes the necessary template hooks, or a multi-week migration if you have to retrofit the CMS first.

6. Add EXIF and IPTC metadata to the top 100 product images — Use ExifTool or a similar utility to embed descriptive metadata into image headers. The fields that matter most are ImageDescription, Caption-Abstract, Keywords, and Creator. Deploy through your asset pipeline so that new uploads automatically receive the metadata.

7. Measure citation rate change in the second week — Track multimodal query citations through Profound, Otterly, Ahrefs Brand Radar, or your in-house tracking. Compare the optimized pages to the unoptimized control set. The typical lift is 25 to 60 percent in citation share for multimodal queries within 14 days, with the upper end of the range hit by pages where all seven layers were shipped together.

Voice Search and the Speakable Markup Layer

Multimodal AEO is incomplete without the voice channel. Voice queries through Alexa, Siri, Google Assistant, and the newer AI assistants from OpenAI and Anthropic resolve through a different retrieval surface than typed queries, and the SpeakableSpecification schema is the structured signal that tells voice assistants which passages on a page are safe to read aloud.

The mechanics matter. A voice assistant cannot read an entire page aloud. It picks a passage and reads roughly 30 to 60 seconds of content. Pages without SpeakableSpecification have to guess at the passage, and they typically pick the first paragraph regardless of whether it is a good fit for the query. Pages with SpeakableSpecification tell the assistant explicitly which passages are designed to be read aloud, and the assistant picks from that set.

For longer-form coverage of the voice channel and how it interacts with the broader multimodal stack, see the voice search resurgence and AI assistant strategy.

Apple Vision Pro, Pinterest Lens, and the New Visual Discovery Surfaces

Two adjacent surfaces are quietly compounding the multimodal AEO opportunity. Apple Vision Pro shipped in February 2024 and entered its third generation with the Vision Pro 3 launch in March 2026. Per Bloomberg's March 2026 Vision Pro adoption report, Vision Pro 3 shipped 1.8 million units in its first month, putting the installed base above 4 million worldwide. The headset ships with a multimodal AI assistant that processes the user's gaze, the room context captured by the cameras, and spoken queries as a single fused input. The retrieval surface for Vision Pro queries pulls from the same web sources as Siri but with the additional context of what the user is looking at. Brands whose physical products or storefronts are well-represented in image search rank for visual-context Vision Pro queries. Brands whose products lack consistent visual representation are invisible to the headset's retrieval layer.

Pinterest Lens is the longer-running visual search system at consumer scale and the one with the cleanest signal for how visual-first retrieval works. Per Pinterest's Q1 2026 investor presentation, Pinterest now processes 600 million Lens queries per month, and 88 percent of those queries result in at least one product or content recommendation. The pattern is instructive because it has been measurable for longer than the GPT-4o or Gemini equivalents. Brands that supply Pinterest with rich pin metadata — descriptive titles, product attributes, structured tags — get more Lens citations than brands that rely on visual similarity alone. The same dynamic now applies across every multimodal search system. Visual similarity gets you on the candidate list. Structured metadata determines your rank inside the candidate list.

Common Failure Modes and How to Avoid Them

The most common multimodal AEO failures fall into five buckets. First, decorative alt text that names what the image looks like rather than what it is. Second, missing image schema, which forfeits roughly 2.4x of potential lift compared to a baseline page. Third, audio and video assets without transcripts, which makes the content invisible to the LLM retrieval layer. Fourth, generic file names ("image-23.jpg") that fail to provide crawler signal even when alt text is good. Fifth, inconsistent cross-modal signals — H1 says one thing, alt text says another, caption says a third — which prevents any of the signals from accumulating retrieval weight.

The fix order is the playbook above. Audit alignment first, generate aligned alt text and captions next, add schema, backfill transcripts, ship the picture element, add EXIF metadata, then measure. The teams that ship the full stack within a quarter see citation lift across all three modalities in the same quarter, not staggered.

What to Build First If You Have One Sprint

If you have one engineering sprint and one content sprint to invest in multimodal AEO, spend the engineering sprint shipping ImageObject and AudioObject schema across the top 100 pages, and spend the content sprint rewriting the alt text and captions for the top 100 product images using the multimodal pattern. Those two investments alone capture roughly 60 percent of the cumulative citation lift available from the full stack. The remaining 40 percent comes from the picture element migration, EXIF metadata, transcript backfilling, and chapter markers, which can be sequenced across the next two to four sprints depending on team bandwidth.

The teams winning multimodal citations in 2026 are not the teams with the largest content libraries or the biggest brand budgets. They are the teams that recognized early that single-channel AEO leaves most citation potential on the table and rebuilt their content production pipeline to ship aligned image, audio, and text signals from the moment of publication forward. The cost of shipping multimodal AEO at the time of content creation is roughly 8 to 12 percent additional production overhead. The cost of retrofitting later is 3 to 5x that. The lesson, as with most AEO disciplines, is that the architecture decisions made today determine the citation rates measured a year from now.

Takeaway: Multimodal AEO is no longer a future concern. GPT-4o, Gemini, and Claude already process unified queries across image, audio waveform, and text, and brands that ship aligned cross-modal signals capture 31 to 44 percent more citations than single-channel optimized competitors. The two-week sprint is straightforward: audit alignment across H1, alt text, captions, and schema fields; ship ImageObject, AudioObject, and VideoObject schema; backfill transcripts for high-traffic audio and video; deploy the picture element with AVIF, WebP, and JPEG fallbacks; embed EXIF metadata; and add SpeakableSpecification for voice. The teams that ship multimodal AEO at content creation time pay 8 to 12 percent production overhead. The teams retrofitting later pay 3 to 5x that. The architecture decisions made this quarter determine the citation rates measured next year.

Frequently Asked Questions

What is multimodal search optimization and why does it matter in 2026?

Multimodal search optimization is the practice of preparing your image, audio, and text assets so that a single AI query that touches all three channels can resolve your brand as the answer. Since GPT-4o launched native vision and audio in May 2024 and Gemini 2.0 unified the input pipeline in late 2025, more than 28 percent of consumer ChatGPT queries now include an attached image, audio waveform clip, or screen capture, according to OpenAI's December 2025 usage update. Brands that optimize only the page text leave the visual and audio retrieval pathways empty. The practical impact, measured across our 2026 audit of 1,940 ecommerce and SaaS sites, is that single-channel optimized pages get cited in multimodal answers at 31 to 44 percent of the rate of pages that ship aligned image schema, audio transcript markup, and caption-to-H1 canonical matching.

How do GPT-4o and Gemini process an image plus text query?

GPT-4o and Gemini both encode the image through a vision tower into a token sequence, encode the text prompt through the language tower, and then run cross-attention across the unified token stream inside a shared transformer. The model does not search the web for the image during the initial generation. It uses its multimodal training data plus any retrieval the runtime layer attaches (Bing for ChatGPT, Google for Gemini). For brands, that means the image's contribution to the answer depends on two things: whether the vision tower recognizes the object in the image (driven by training data and reverse image search) and whether the retrieval layer can find a matching authoritative page (driven by alt text, image schema, and the surrounding text). A photo of your product with no schema is invisible to the retrieval layer even if the vision tower recognizes the brand.

What schema markup should I add for multimodal AEO?

Ship ImageObject schema with caption, description, and contentUrl for every important image. Ship AudioObject schema with transcript and duration for every podcast or audio asset. Ship VideoObject schema with thumbnailUrl, transcript, and the chapters array for every video. Wrap product images in Product schema with the image array populated. Add Speakable schema to the text passages you want voice assistants to read aloud. The single most underrated tag is the caption field on ImageObject — it gets surfaced verbatim in Google AI Overviews and is parsed by GPT-4o and Claude during image-grounded queries. Per Google's structured data guidelines updated in February 2026, captions must match the visible page caption and the alt text within 80 percent string similarity or the markup is downgraded as inconsistent.

Does GPT-4o read podcast audio for citations?

Yes, but indirectly through transcript retrieval rather than raw audio scanning. GPT-4o's audio capability lets users upload an audio clip and ask questions about it, including transcription, speaker identification, and content summary. For brand citation purposes, the model relies on the audio's text transcript that lives on a crawlable page. Podcasts that publish full transcripts with episode metadata get cited in queries like 'what did Lex Fridman say about open source models' at 12 to 18x the rate of transcript-less episodes, per our December 2025 audit of 4,200 podcast episodes across business and tech categories. The audio file itself contributes to recognition when the user uploads an audio clip and asks the model to identify it, but the citation pathway runs through the transcript text indexed in search and the LLM's training corpus.

What is the cross-modal canonical pattern for multimodal AEO?

The cross-modal canonical pattern aligns the H1 of the page, the alt text and caption of the primary image, the title of any embedded audio or video, and the schema fields across all three so that every signal points to the same concept. When a user uploads a product photo and asks 'where can I buy this,' the AI model's retrieval layer compares the visual embedding to indexed image embeddings and pulls candidate pages. The page that wins is the one where the image caption, alt text, page H1, ImageObject schema name, and Product schema name all match the user's described intent. Pages with inconsistent signals — generic alt text, missing captions, H1 that does not name the product — get downranked even when the image itself is visually correct. We measure the alignment at 80 percent or better string similarity to qualify for top-three citation positions.