The Great AI Inference Migration: Why Every Company Is Switching Models Every 90 Days

Model switching costs dropped to near zero. 68% of enterprises now use three or more LLM providers. Average model tenure is 87 days and shrinking. The model layer is commoditizing faster than anyone predicted, and the real lock-in is moving to the orchestration layer that sits above it.

By Raj Patel, AI & Infrastructure · Mar 10, 2026 · 14 min read

In January 2026, the infrastructure team at a Fortune 500 financial services firm completed a migration from GPT-4o to Claude 3.5 Sonnet across 14 production applications. The migration took 11 hours. Nine months earlier, a similar migration from GPT-4 to GPT-4o had taken the same team six weeks. The difference was not engineering skill. It was that standardized API formats, model routing layers, and abstraction libraries had reduced the switching cost from a major infrastructure project to a configuration change.

That firm is not unusual. According to Flexera's 2026 State of AI Infrastructure report, 68% of enterprises now use three or more LLM providers in production. Forty-one percent maintain active contracts with five or more. The average tenure of a primary model, the LLM handling the majority of an organization's inference volume, has dropped to 87 days, down from roughly 14 months in early 2024.

The AI industry spent 2023 and 2024 debating which model would win. The answer, increasingly clear in 2026, is that no model wins permanently. The model layer is commoditizing at a speed that makes even cloud computing's commoditization look gradual. And the implications for pricing, market structure, and where value accrues in the AI stack are enormous.

The Switching Cost Collapse

To understand why model migration accelerated so dramatically, you need to trace three simultaneous developments that converged in late 2025.

First, API standardization. When OpenAI released the ChatCompletions API format in March 2023, it became the de facto standard, not because it was technically superior, but because it was first and developers built around it. By mid-2025, every major model provider, Anthropic, Google, Mistral, Cohere, and every significant open-source inference platform, offered an OpenAI-compatible API endpoint. Together AI, Fireworks AI, Groq, and Replicate all adopted the same request and response format for hosted open-source models.

This convergence was not accidental. Model providers realized that requiring developers to learn a proprietary API format was a friction point that cost them adoption. Anthropic's decision to offer an OpenAI-compatible mode alongside its native API in August 2025 was the symbolic tipping point. When even the company with the most technically differentiated API chose compatibility over lock-in, the standardization war was over.

The practical effect: a developer can swap model: "gpt-4o" for model: "claude-3-5-sonnet-20250815" in a single line of code and, for most use cases, get a working application with zero other changes. That is a switching cost of approximately zero.

Second, abstraction libraries. Tools like LiteLLM (22,000+ GitHub stars), the OpenAI Python SDK, and various provider SDKs made multi-model support a configuration issue rather than an engineering project. LiteLLM provides a single interface to over 100 LLM providers. A team using LiteLLM can add a new model provider with a single environment variable.

Third, the routing layer. Platforms like OpenRouter, Portkey, Martian, and Unify went a step further than abstraction libraries. They not only normalized the API interface but added intelligent routing: automatically directing each request to the optimal model based on cost, latency, quality scores, and availability. OpenRouter now processes over 3 billion tokens per day across 200+ models. That volume represents a meaningful share of global LLM inference traffic flowing through a single routing layer.

The combined result of these three forces is that model switching costs have dropped from weeks of engineering effort in 2023 to hours or minutes in 2026. And when switching costs approach zero, loyalty evaporates.

The 87-Day Model Tenure

The data on model churn is striking. We compiled model adoption timelines from [a]16z's AI infrastructure survey](https://a16z.com/ai-infrastructure-survey-2026/), Portkey's anonymized routing data, and public procurement records from USAspending.gov to construct a timeline of enterprise model adoption.

Period	Dominant Model	Avg. Enterprise Tenure	Key Displacement Event
Q1 2023 – Q3 2023	GPT-4	18 months	No meaningful competitor
Q4 2023 – Q2 2024	GPT-4 Turbo	8 months	Claude 2.1 eroded share at margins
Q3 2024 – Q4 2024	Claude 3.5 Sonnet	5 months	Benchmark leadership + lower cost
Q1 2025 – Q2 2025	GPT-4o	4 months	Multimodal + price cuts
Q3 2025 – Q4 2025	Claude 3.5 Sonnet (v2)	3.5 months	Extended thinking, code quality
Q1 2026 – present	Multi-model (no single dominant)	N/A	Routing layers enable continuous rebalancing

The pattern is unmistakable. Each generation of models had a shorter reign than the last. And the Q1 2026 row is the most significant: for the first time, there is no single dominant model across enterprise deployments. Instead, companies are running a diversified portfolio, routing different workloads to different models based on the specific cost-quality-latency tradeoff each task requires.

Portkey's 2026 Model Usage Report confirms this fragmentation. Among their enterprise customers:

34% of inference traffic goes to OpenAI models (down from 71% in January 2025)
28% goes to Anthropic models (up from 14%)
19% goes to Google Gemini models (up from 6%)
11% goes to open-source models via hosted providers (up from 4%)
8% goes to specialized or regional models (DeepSeek, Mistral, Qwen)

No single provider commands majority share. This is a structural shift, not a temporary fluctuation.

The Economics: Why No Model Has Durable Pricing Power

The pricing trajectory of frontier AI models tells the commoditization story in dollar terms. Here is what $1 million in inference spend bought you at each point in time, normalized to GPT-4-equivalent quality output:

Date	Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Effective $/quality unit
Mar 2023	GPT-4	$30.00	$60.00	$1.00 (baseline)
Nov 2023	GPT-4 Turbo	$10.00	$30.00	$0.44
Jun 2024	Claude 3.5 Sonnet	$3.00	$15.00	$0.20
May 2024	GPT-4o	$5.00	$15.00	$0.22
Dec 2024	Gemini 1.5 Pro	$1.25	$5.00	$0.07
Jan 2025	DeepSeek V3	$0.27	$1.10	$0.015
Feb 2025	GPT-4o mini	$0.15	$0.60	$0.008
Mar 2026	Llama 4 (self-hosted)	$0.05	$0.05	$0.001

In three years, the cost of GPT-4-equivalent inference fell by approximately 1,000x. This is not a gradual decline. It is a price collapse.

The mechanism driving the collapse is the same one that drove cloud compute prices down in 2010-2018: a combination of hardware improvements (Nvidia's Blackwell architecture delivers roughly 4x the inference throughput per dollar of Hopper), software optimization (quantization, speculative decoding, continuous batching), and competitive pressure from open-source alternatives that establish a price floor near marginal cost.

DeepSeek's V3 model, released in January 2025, was the single most disruptive pricing event. A Chinese lab trained a model competitive with GPT-4o at a reported cost of $5.6 million, a fraction of what OpenAI, Anthropic, or Google spent on their frontier models. Then DeepSeek offered API access at prices 10-20x below Western competitors. This forced an industry-wide repricing. OpenAI cut GPT-4o mini prices by 60% within three months. Anthropic introduced Haiku at aggressive price points. Google slashed Gemini 1.5 Pro pricing twice.

The lesson was clear: when a credible open-source alternative can replicate 90% of a frontier model's capability at 5% of the cost, the proprietary premium collapses. And open-source models are now reaching that threshold within 3-6 months of each proprietary release, down from 12-18 months in 2023.

The Model Arbitrage Strategy

The rational response to a market with falling prices, converging quality, and near-zero switching costs is not to pick a winner. It is to arbitrage the entire market continuously.

Model arbitrage is the practice of routing each inference request to the cheapest model that meets a minimum quality threshold for that specific task. It is already the default strategy among sophisticated AI engineering teams, and it is rapidly spreading to mainstream enterprise deployments.

The mechanics work like this. A company defines a taxonomy of inference tasks, typically 5-15 categories spanning their applications. For each category, they establish a quality threshold based on automated evaluation (using benchmarks, human preference scores, or task-specific metrics). Then a routing layer, either built in-house or provided by a platform like Martian or Unify, directs each request to the cheapest model that clears the quality bar for that category.

Here is what a typical routing configuration looks like for an enterprise SaaS company:

Task Category	Quality Threshold	Routed Model	Cost per 1M tokens (blended)	% of Total Traffic
Simple classification / tagging	Low	GPT-4o mini	$0.38	35%
Content summarization	Medium-low	Gemini 1.5 Flash	$0.35	18%
RAG / document Q&A	Medium	Claude 3.5 Haiku	$0.80	22%
Code generation	High	Claude 3.5 Sonnet	$9.00	12%
Complex reasoning / analysis	Very high	GPT-4o / Claude Opus	$22.50	8%
Creative writing / marketing	Medium-high	Claude 3.5 Sonnet	$9.00	5%

The weighted average cost across this portfolio is approximately $2.40 per million tokens. If the same company routed everything through a single frontier model, the cost would be $15-$22 per million tokens. The arbitrage saves 84-89% on inference costs.

Martian's production data shows that 62% of enterprise queries can be handled by models costing less than $1 per million input tokens. Only 8-12% of queries genuinely require frontier-model capability. The remaining 26-30% sit in a middle tier where mid-range models deliver adequate quality.

The implication for model providers is severe. If the majority of inference volume flows to the cheapest adequate model, then the premium a frontier model can charge is limited to the 8-12% of queries where it has no substitute. For the other 88-92% of traffic, the model layer is a commodity market where the lowest bidder wins.

The New Lock-In: Orchestration and Routing Layers

If switching between models is trivial, then model providers lose lock-in. But lock-in does not disappear. It migrates up the stack to the orchestration and routing layers that manage multi-model deployments.

Consider what happens when a company adopts a platform like OpenRouter or Portkey. Initially, it is a simple proxy: route requests to model A or model B based on a flag. Over time, the integration deepens:

Routing rules encode business logic about which models handle which tasks
Fallback chains define what happens when a primary model is down or rate-limited
Cost budgets enforce per-team or per-application spending limits
Caching layers store frequently accessed responses to reduce redundant inference
Observability hooks feed latency, cost, and quality metrics into dashboards
Prompt management systems version and deploy prompts optimized for specific models
Compliance filters apply organization-specific content policies across all models

Each of these features adds value. Each also adds a dependency that makes migrating away from the routing platform progressively harder. A company that has spent six months building routing rules, fallback chains, and compliance configurations in Portkey faces a significant migration cost to switch to OpenRouter, even if switching between the underlying models remains trivial.

This is the irony of the multi-model era: the tools that liberate companies from model lock-in are themselves becoming the new lock-in point.

The data supports this pattern. OpenRouter's public metrics show daily active developers growing from approximately 12,000 in January 2025 to over 85,000 in March 2026, a 7x increase. LiteLLM's GitHub repository has gone from 8,000 to 22,000 stars in the same period. Portkey raised a $23 million Series A in November 2025 and reports processing over $50 million in annualized model inference spend through its gateway.

The routing layer companies are small today. But they sit at a chokepoint in the AI stack. Every token that flows through their infrastructure generates routing data, cost data, quality data, and latency data that can be used to build better routing algorithms, creating a data flywheel that reinforces their position.

The Cloud Computing Parallel

The historical parallel to cloud computing is almost too clean.

In 2008-2012, enterprises debated whether to go all-in on AWS or build private clouds. Amazon had a massive head start, a standardized API (S3, EC2), and aggressive pricing. The consensus was that AWS would dominate indefinitely.

Then two things happened simultaneously. First, competitors (Azure, GCP) achieved capability parity on most workloads. Second, multi-cloud abstraction layers (Terraform, Kubernetes, CloudFormation) made it possible to deploy across providers without rewriting applications. By 2018, Flexera's annual cloud survey showed 81% of enterprises using a multi-cloud strategy.

AWS maintained its lead in absolute market share. But its pricing power eroded. Cloud compute prices fell roughly 10-15% per year through the 2010s. AWS's operating margins stabilized rather than expanded. The commoditization of infrastructure drove value to the application layer, where Snowflake, Datadog, and Confluent built sticky platforms on top of commodity cloud resources.

The AI model market is following the same trajectory, compressed into about one-third the time:

Cloud Computing (2008-2020)	AI Models (2023-2026)	Timeline Compression
AWS dominates with 65% share	OpenAI dominates with 70%+ API share	—
Azure, GCP reach parity	Claude, Gemini reach parity	3 years vs. 8 years
S3/EC2 API becomes standard	OpenAI ChatCompletions format becomes standard	2 years vs. 6 years
Multi-cloud becomes default (81%)	Multi-model becomes default (68%)	2.5 years vs. 10 years
Terraform/K8s enable portability	LiteLLM/OpenRouter enable portability	2 years vs. 5 years
Cloud prices fall 10-15%/year	Model prices fall 60-80%/year	4-8x faster
Application layer captures value	Application/orchestration layer captures value	Emerging

The compression factor is approximately 3x. What took cloud computing a decade is happening in the AI model market in three to four years. The reason is that software abstractions (API compatibility, routing layers) are faster to build and adopt than infrastructure abstractions (containerization, orchestration platforms).

There is, however, one critical difference. In cloud computing, the underlying infrastructure (data centers, servers, networking) had massive capital requirements that naturally limited the number of credible competitors. In AI models, the training cost for frontier models is high ($100M-$1B+), but inference serving can be done by anyone with GPU access and an API endpoint. This means the competitive field for AI inference is far larger than the competitive field for cloud infrastructure, which implies even faster commoditization.

Who Benefits: The Application Layer Thesis

If the model layer is commoditizing, where does value accrue?

The answer, supported by both theory and evidence, is the application layer: companies that build workflow-specific software on top of interchangeable models, creating lock-in through data, integrations, and user habits rather than through proprietary model capabilities.

Consider the following companies, all of which are model-agnostic and have explicitly designed their products to swap underlying models:

Company	Product	Revenue (ARR)	Model Strategy	Lock-In Source
Cursor	AI code editor	$2B+	Uses Claude, GPT-4o, Gemini	Workspace state, keybindings, tab completion model
Jasper	AI marketing content	$350M+	Routes across 5+ models	Brand voice profiles, campaign templates, team workflows
Harvey	AI legal assistant	$200M+	Multi-model, task-dependent	Legal document corpus, firm-specific training data
Glean	Enterprise AI search	$150M+	Model-agnostic RAG	Enterprise knowledge graph, permissions, connectors
Intercom	AI support (Fin)	$100M+ (AI revenue)	Swaps models per release	Conversation history, resolution workflows, training data

None of these companies are locked into a single model provider. Cursor shifted from primarily GPT-4 to primarily Claude between 2024 and 2025 with minimal user-facing disruption. Jasper has publicly stated it routes content generation across multiple models based on the task. Harvey uses different models for different legal reasoning tasks.

Their lock-in comes from the application layer: data they accumulate (Glean's enterprise knowledge graphs, Harvey's legal document corpus), workflows they embed in (Cursor's editor state, Intercom's support queue), and switching costs they create through integration depth rather than model dependency.

This is the strongest argument that the model layer will become, like cloud compute, a necessary but low-margin input to the real value creation happening above it.

Enterprise Multi-Model Strategies: The Bake-Off Economy

The shift to multi-model has fundamentally changed how enterprises procure AI. The era of a single, long-term model contract is ending. In its place is what procurement teams now call the "model bake-off" process: a structured, recurring evaluation where multiple models are tested against production workloads and scored on a standardized rubric.

McKinsey's March 2026 enterprise AI survey found that 73% of companies with over $1 billion in revenue now run formal model evaluations at least quarterly. Thirty-one percent evaluate monthly or continuously. The bake-off process typically follows a standardized pattern:

Phase 1: Benchmark suite (Days 1-3). The AI platform team runs a standardized benchmark suite of 500-2,000 test cases drawn from actual production queries. Models are scored on accuracy, latency, cost, and consistency. This phase eliminates models that do not meet baseline requirements.

Phase 2: Shadow deployment (Days 4-14). Top-scoring models are deployed in shadow mode alongside the current production model. Real traffic is duplicated to the candidate model, and responses are compared using automated evaluation frameworks (LLM-as-judge, reference matching, human spot-checks). This phase reveals performance differences that benchmarks miss.

Phase 3: Staged rollout (Days 15-30). The winning model is rolled out to 10%, then 25%, then 50%, then 100% of production traffic, with automated monitoring for quality regressions. If quality drops below thresholds at any stage, traffic reverts automatically.

Phase 4: Contract negotiation (Ongoing). Armed with competitive benchmark data, procurement teams negotiate pricing with the selected provider, using the demonstrated viability of alternatives as leverage.

This process has profoundly changed the negotiating dynamic between model providers and enterprise customers. When a procurement team can show that Claude Sonnet scores within 2% of GPT-4o on their specific workload at 40% lower cost, OpenAI's ability to maintain premium pricing is severely constrained.

The bake-off economy also explains why model providers are investing heavily in non-model features: enterprise compliance certifications (SOC 2, HIPAA, FedRAMP), fine-tuning infrastructure, dedicated capacity, and SLA guarantees. These features create switching costs that the model itself no longer provides.

The Pricing War: A Provider-by-Provider Analysis

Each major model provider is responding to commoditization pressures differently. Here is where pricing and strategy stood as of March 2026:

OpenAI

OpenAI has cut prices more aggressively than any competitor, reducing GPT-4o input pricing from $5/1M tokens at launch to $2.50/1M tokens by March 2026, with volume discounts pushing effective pricing below $1.50/1M tokens for large customers. GPT-4o mini, launched at $0.15/1M input tokens, has become the workhorse model for cost-sensitive workloads and now accounts for an estimated 60% of OpenAI's API inference volume by token count.

But OpenAI's strategy is not to win on price. It is to win on platform. ChatGPT Enterprise, custom GPTs, the Assistants API with file search and code interpreter, and the recently launched Operator agentic framework are all designed to create workflow lock-in that persists regardless of which underlying model a customer uses. OpenAI's bet is that the model becomes a feature of the platform, not the product itself.

Revenue data supports the approach. OpenAI's annualized revenue reportedly crossed $11.6 billion in early 2026, with ChatGPT subscriptions (consumer and enterprise) accounting for roughly 55% and API revenue accounting for 45%. The subscription revenue carries higher margins and lower churn than API revenue, which is increasingly price-competitive.

Anthropic

Anthropic's strategy centers on differentiation through reliability, safety, and enterprise trust. Claude's positioning as the model enterprises choose for regulated industries, sensitive data processing, and high-stakes reasoning has allowed Anthropic to maintain higher per-token pricing than competitors for its frontier models while growing market share.

Claude 3.5 Sonnet's success in the coding segment, where it has become the default model for AI coding tools including Cursor, Windsurf, and Cline, demonstrates the strategy. Developers pay a premium for Claude's code quality and instruction-following precision, and the workflow lock-in comes from the coding tools built around it.

Anthropic's annualized revenue reportedly reached $3.6 billion by Q1 2026, growing faster than OpenAI in percentage terms. The company has avoided aggressive price-cutting on frontier models, instead introducing Haiku variants to compete on cost at the lower end while keeping Sonnet and Opus pricing relatively stable.

Google (Gemini)

Google's approach is the most aggressive on pricing because Google can afford to treat models as a loss leader. Gemini 1.5 Pro pricing at $1.25/1M input tokens for the standard tier undercuts both OpenAI and Anthropic by 50-70% for comparable quality. The 1M-token context window, offered at a fraction of competitors' pricing for long-context tasks, is a unique advantage that no other provider has matched economically.

The strategy is straightforward: use Gemini to drive adoption of Google Cloud Platform, Google Workspace AI features, and the broader Google ecosystem. Model revenue does not need to be profitable if it drives $10-20 in incremental platform revenue for every $1 in model API revenue.

Google Cloud's AI revenue reportedly grew 80% year-over-year in 2025, though the company does not break out Gemini API revenue specifically. The bundling strategy makes it difficult for competitors to match Google's effective pricing without similar platform economics.

Open Source (Llama, DeepSeek, Qwen, Mistral)

The open-source model ecosystem is the ultimate price pressure mechanism. Meta's Llama 4, released in February 2026, matches or exceeds GPT-4o on most standard benchmarks. When self-hosted on commodity GPU infrastructure, inference costs for Llama 4 run approximately $0.05 per million tokens for both input and output, essentially 99.8% cheaper than GPT-4 was at launch three years ago.

DeepSeek's V3 and reasoning-focused R1 models have been particularly disruptive because they come from a Chinese lab operating on fundamentally different economics. DeepSeek's reported training budget of $5.6 million for V3 is orders of magnitude below what Western labs spend, challenging the assumption that frontier model development requires billions in capital.

The open-source tier establishes a price floor for the entire market. No proprietary model provider can charge more than 5-10x the open-source self-hosting cost for comparable quality without losing volume to hosted open-source alternatives. This ceiling is falling as open-source quality converges with proprietary models.

The Hidden Switching Cost: Prompt Engineering

While API compatibility has reduced the technical switching cost to near zero, one significant switching cost remains: prompt engineering.

A company that has spent three months optimizing prompts for GPT-4o, developing system prompts, few-shot examples, chain-of-thought templates, and output formatting instructions, will find that those same prompts produce subtly different results on Claude Sonnet or Gemini Pro. The differences are often small: a slightly different JSON structure, different verbosity, different handling of edge cases. But in production systems where downstream processing depends on consistent output formats, these differences can cause failures.

Braintrust's 2026 developer survey found that 58% of engineering teams cite prompt adaptation as the largest time investment when switching models. The average time to adapt a production prompt suite for a new model is 3-5 days of engineering effort, not the hours that API-level switching requires.

This is why prompt management and evaluation platforms, tools like Braintrust, Humanloop, and PromptLayer, are growing rapidly. They version-control prompts, run automated evaluations across multiple models, and maintain model-specific prompt variants that can be deployed instantly. A team using these platforms can maintain optimized prompts for three or four models simultaneously, enabling instant switching when routing logic or pricing changes warrant it.

The prompt portability problem is also driving a subtle convergence in model behavior. Model providers are increasingly training their models to respond consistently to common prompting patterns, including patterns originally developed for competitors. Claude has become better at following prompts written for GPT-4, and vice versa. This behavioral convergence further reduces switching costs over time.

The Data Moat Question

If models are commoditizing and switching costs are falling, is there any durable moat at the model layer?

The strongest candidates are:

Proprietary training data. Models trained on unique, high-quality datasets that competitors cannot access may maintain persistent quality advantages on specific tasks. This is more likely for domain-specific models (legal, medical, financial) than general-purpose models.

Inference speed and infrastructure. Groq's LPU architecture demonstrates that inference hardware innovation can create meaningful differentiation. If a provider can serve the same model quality at 10x the speed, latency-sensitive applications will route traffic there even at a premium.

Fine-tuning ecosystems. A model provider that makes it easy to fine-tune on proprietary data, and offers the resulting model with competitive inference economics, can create lock-in through the customer's investment in fine-tuning. OpenAI's fine-tuning platform and Anthropic's custom model partnerships are both targeting this vector.

Safety and compliance certifications. For regulated industries, the compliance infrastructure around a model (SOC 2 Type II, HIPAA BAA, FedRAMP authorization) represents a multi-month, multi-million-dollar investment that does not transfer between providers. This creates genuine switching costs for healthcare, financial services, and government customers.

None of these moats is as strong as the model quality advantage that OpenAI enjoyed in 2023. But they are real, and they explain why model providers are investing heavily in non-model capabilities.

Implications for the Market Structure

The commoditization thesis leads to a specific market structure prediction. Within 18-24 months:

The model layer becomes an oligopoly with low margins. Three to five major providers (OpenAI, Anthropic, Google, Meta/open-source, and possibly a Chinese provider like DeepSeek) will serve the vast majority of inference volume. Pricing will converge toward marginal cost plus a modest premium for reliability and compliance. This is the cloud computing analog: AWS, Azure, and GCP all offer nearly identical compute at similar prices.

The orchestration layer becomes a bottleneck. Routing and orchestration platforms will consolidate around two to three winners, similar to how Kubernetes won container orchestration. The winner will be determined by developer adoption and ecosystem breadth, not by technical superiority. OpenRouter and LiteLLM are currently the frontrunners, but the market is early enough that the outcome is uncertain.

The application layer captures the most value. Companies that build specific, valuable workflows on top of commodity models, and create lock-in through data, integrations, and user habits, will capture the majority of the economic value in the AI stack. This is the Snowflake/Datadog pattern: build a sticky application on top of commodity infrastructure.

Enterprise procurement becomes permanently adversarial. The bake-off economy will not revert to single-vendor contracts. Procurement teams have discovered that model competition gives them leverage, and they will maintain multi-model strategies specifically to preserve that leverage, even if a single model is slightly better across all dimensions.

What This Means for Investors

The investment implications of model commoditization are directional and significant.

Underweight: Pure model providers without platform lock-in. Companies whose primary revenue comes from per-token API pricing face persistent margin pressure as each new model generation delivers better quality at lower prices. This includes providers that depend on being the "best model" for their market position, because the window of superiority for each model generation is shrinking from years to months.

Overweight: Application layer companies with workflow lock-in. Companies that use models as inputs to specific, valuable workflows, and create switching costs through data, integrations, and user habits rather than model dependency, are best positioned. Look for companies with model-agnostic architectures that can swap providers without disrupting users.

Watch: Orchestration layer companies at inflection. The routing and orchestration layer is in its early innings. If a company like OpenRouter or Portkey captures a dominant position in model routing, it could become the Cloudflare of AI inference: a critical chokepoint that processes a significant share of global AI traffic and monetizes through routing optimization, caching, and value-added services.

Avoid: Undifferentiated model hosting. Companies that simply offer model inference without unique infrastructure (custom hardware like Groq), unique models (fine-tuned verticals), or unique platform features (routing, observability, compliance) face the most acute pricing pressure. The market for commodity model hosting will likely consolidate to two to three large players plus the hyperscalers.

The 2027 Outlook

If current trends continue, the AI inference market in 2027 will look structurally similar to the cloud computing market in 2018:

Three to four major providers offering comparable capabilities at similar prices
Multi-provider strategies as the overwhelming default (80%+ of enterprises)
An established orchestration layer that enables seamless portability
Value concentrated in the application layer above the infrastructure
Continuous price declines of 30-50% per year at the model layer
Persistent differentiation only at the extreme frontier and in compliance/trust

The companies building for this future, designing model-agnostic architectures, investing in orchestration layers, and creating lock-in through workflow and data rather than model dependency, will outperform those betting on a single model maintaining its advantage.

The great AI inference migration is not a one-time event. It is a permanent condition. The companies that thrive will be those that architect for continuous model change rather than model stability. In a world where the best model changes every 90 days, the only durable advantage is the ability to switch.

Frequently Asked Questions

Why are enterprises switching AI models so frequently?

Enterprises are switching primary LLM providers approximately every 87 days because the combination of standardized APIs, commoditized inference pricing, and rapid model quality convergence has eliminated meaningful switching costs. OpenAI-compatible API formats are now supported by virtually every model provider, meaning a migration that once required weeks of engineering can be completed in hours. Meanwhile, new model releases from Anthropic, Google, Meta, and DeepSeek arrive every 6-10 weeks, each offering better performance-per-dollar ratios than its predecessor. According to Flexera's 2026 State of AI report, 68% of enterprises now use three or more LLM providers simultaneously, and 41% maintain active contracts with five or more. The rational strategy is no longer to pick a winner but to continuously route traffic to the best available model for each task.

What are model routing and orchestration layers, and why do they matter?

Model routing and orchestration layers are software platforms that sit between an application and multiple LLM providers, automatically directing each inference request to the optimal model based on cost, latency, quality, and availability. Key players include OpenRouter, LiteLLM, Portkey, Martian, and Unify. These platforms matter because they are becoming the new lock-in point in the AI stack. While switching between GPT-4o and Claude Sonnet is now trivial at the API level, migrating away from an orchestration layer that handles routing logic, fallback chains, cost optimization, rate limit management, and observability is far more difficult. OpenRouter processes over 3 billion tokens per day across 200+ models. LiteLLM has 22,000+ GitHub stars and is embedded in thousands of production applications. The orchestration layer is capturing the durable value that model providers are losing.

How much can companies save with model arbitrage strategies?

Model arbitrage, the practice of routing each query to the cheapest model that meets a quality threshold, can reduce inference costs by 40-72% without measurable quality degradation for most workloads. A typical enterprise strategy routes simple classification and extraction tasks to lightweight models like GPT-4o mini or Claude Haiku at $0.25-$0.80 per million tokens, medium-complexity reasoning to mid-tier models like Claude Sonnet or Gemini 1.5 Pro at $3-$15 per million tokens, and only escalates complex multi-step reasoning to frontier models like GPT-4o, Claude Opus, or Gemini Ultra at $15-$75 per million tokens. Martian's production data shows that 62% of enterprise queries can be handled by models costing less than $1 per million input tokens. The remaining 38% require mid-tier or frontier models but only account for 15-20% of total query volume by count.

Is the AI model layer really commoditizing like cloud compute did?

The structural parallels to cloud computing commoditization are strong but imperfect. Like cloud compute in 2010-2015, AI models are converging on standardized interfaces (the OpenAI API format is the equivalent of the S3 API), pricing is falling 10-15x per year, and multi-provider strategies are becoming the default. However, unlike cloud compute, model capabilities still differ meaningfully at the frontier. Claude Opus outperforms competitors on extended reasoning and code generation, GPT-4o leads on certain multimodal tasks, and Gemini has advantages in long-context processing. The commoditization is happening fastest at the lower and mid tiers, where open-source models like Llama 4 and DeepSeek V3 have reached quality parity with proprietary alternatives from 12 months ago. At the frontier, differentiation still exists but the window is narrowing to 3-6 months rather than the 12-18 months it was in 2023.

How are OpenAI, Anthropic, and Google responding to model commoditization?

Each major provider is pursuing a different strategy to maintain pricing power as the model layer commoditizes. OpenAI is moving aggressively into the application layer with ChatGPT Enterprise, custom GPTs, and platform features like memory and file storage that create workflow lock-in beyond the model itself. Anthropic is emphasizing safety, reliability, and enterprise compliance, positioning Claude as the model procurement teams choose when risk tolerance is low. Google is leveraging vertical integration, bundling Gemini with Google Cloud, Workspace, and its advertising stack to make the model a loss leader that drives platform revenue. All three have cut prices by 60-85% over the past 18 months, with GPT-4o-level capability now available at roughly 1/10th the price OpenAI charged for GPT-4 at its March 2023 launch. The price war is accelerating as open-source models close the quality gap.

What should enterprise AI teams do to prepare for a multi-model world?

Enterprise AI teams should implement four structural changes. First, adopt a model-agnostic abstraction layer from day one. Whether using OpenRouter, LiteLLM, Portkey, or a custom gateway, every LLM call should pass through a routing layer that decouples application logic from any specific provider. Second, establish a continuous model evaluation pipeline that benchmarks new releases against production workloads within 48 hours of launch. Companies running quarterly evaluations are already falling behind. Third, negotiate contracts that reflect the new reality: shorter terms (6-12 months maximum), volume-based pricing with no minimums, and explicit provisions for multi-provider deployments. Fourth, invest in prompt portability. The biggest hidden switching cost is not the API integration but the prompt engineering. Teams that structure prompts as data, version-controlled and model-parameterized, can migrate between providers in hours rather than weeks.