The PLG Reset: When Your First 'User' Is an AI Agent
From $15 to $1.10 per million input tokens in under two years: the economics behind the inference race, what it means for your AI spend, and the margin math every SaaS team needs to run now.
In January 2024, running one million tokens through OpenAI's o1 model—then the state-of-the-art reasoning system—cost $15 on input and $60 on output. By mid-2026, you can run equivalent or superior reasoning through o3 for $2 input and $8 output, and through o4-mini for $1.10 input and $4.40 output. That is a price reduction of 87% to 93% in under 18 months for frontier-grade capability. According to OpenAI's published API pricing, o4-mini now costs 13.6 times less than o1 per input token.
This is not a normal software pricing trend. Cloud compute prices fell 30-40% over the decade following AWS's launch. AI inference pricing just moved 90% in 18 months. The speed of the compression changes its strategic meaning: this is not a gradual shift that teams can absorb by updating models periodically. It is a structural re-pricing of AI capability that is already inside your COGS, your competitive dynamics, and your vendor contracts—whether you have noticed or not.
Signal's 2026 Enterprise AI Model Scorecard documents the current generation's benchmark performance in detail. This piece is about the other half of that picture: what the price trajectory means for the economic decisions that compound from it.
Where Prices Actually Stand in Mid-2026
The table below reflects current list pricing across the three major frontier AI providers for their principal model tiers. These are standard API rates without negotiated enterprise discounts, prompt caching, or batch processing.
| Model | Provider | Input ($/1M tokens) | Output ($/1M tokens) | Input vs. o1 2024 |
|---|---|---|---|---|
| o1 (Jan 2024, reference) | OpenAI | $15.00 | $60.00 | baseline |
| GPT-4o | OpenAI | $2.50 | $10.00 | -83% |
| o3 | OpenAI | $2.00 | $8.00 | -87% |
| o4-mini | OpenAI | $1.10 | $4.40 | -93% |
| Claude Opus 4.8 | Anthropic | $5.00 | $25.00 | -67% |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | -80% |
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | -93% |
| Gemini 2.5 Pro | $1.25 | $10.00 | -92% | |
| Gemini 2.5 Flash | $0.30 | $2.50 | -98% | |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | -99% |
The compression is consistent across all three major providers and every tier within each provider's lineup. The cheapest widely-available frontier option, Gemini 2.5 Flash-Lite, costs $0.10 per million input tokens—a price point that was not available for any capable model anywhere in 2024. Anthropic's current pricing tells a particularly dramatic story: Opus 4.1 launched at $15/$75 input/output, and within 18 months the equivalent top-tier Opus product had been cut 67%.
This compression is the result of two simultaneous forces: hardware efficiency improvements in inference infrastructure (custom ASICs, optimized kernels, speculative decoding) and competitive pressure from an increasingly crowded provider market. Neither force shows signs of reversing. Google's Tensor Processing Unit fourth generation and NVIDIA's Blackwell architecture each offer another step-function in inference efficiency. The Cerebras Wafer Scale Engine has reshaped what is possible at the extreme of inference throughput—a development Signal covered in depth when Cerebras IPO'd in May 2026.
The Three Phases of the Price War
Understanding what drives the compression matters for forecasting where prices go next. The AI pricing race has moved through three identifiable phases.
Phase 1: Training cost reduction (2022-2024). The first wave of price compression came from more efficient training runs. Better architectures, mixture-of-experts designs, and improved data quality meant that each new model generation achieved higher benchmark performance per dollar of training compute. The direct consumer of these gains was model capability; pricing remained high because inference infrastructure had not caught up.
Phase 2: Inference at launch (2024-early 2026). As models moved from laboratory to production deployment, inference infrastructure became the constraint. Providers deployed initial capacity expensively—reserved GPUs, early ASICs, relatively inefficient serving stacks. Prices in this phase tracked infrastructure cost plus margin, with the innovation lag keeping prices above the long-run equilibrium.
Phase 3: Inference at scale (mid-2026 onward). The phase we are now entering is characterized by specialized inference infrastructure deployed at scale. Custom ASICs specifically designed for transformer inference (not just training), speculative decoding pipelines that reduce output generation latency by 40-60%, and batching efficiency improvements that let providers serve more users per GPU all compound. The marginal cost of serving an additional token is collapsing, and providers are passing that cost structure change through to pricing as competitive pressure requires.
The implication for forecasting: Phase 3 is not over. The hardware generation cycle continues, and the providers with the largest scale advantages will continue extracting efficiency gains. The market structure—three major providers plus several well-funded entrants—ensures those gains translate into pricing pressure. Directionally, this curve points toward input pricing under $0.50/M for mid-tier models within 18 to 24 months from today.
What Cheaper Inference Actually Does to Product Economics
For product teams and CFOs managing AI spend, the math shift is significant—but the implications depend on which position you are starting from.
If you are a SaaS company embedding AI via API: The gross margin math that looked borderline in 2024 looks materially better in 2026. Consider a feature that runs a 2,000-token prompt and generates a 500-token response per user interaction. At o1 2024 pricing ($15/$60), that interaction cost $0.066. At o4-mini 2026 pricing ($1.10/$4.40), the same interaction costs $0.0047—a 93% reduction. At $10/month per user, a product that previously had a 0.66% gross margin drag per daily active feature use now has a 0.047% drag. For features used five times per day, that is the difference between a 3.3% COGS pressure and a 0.24% one. The economic headroom to expand AI feature usage, lift token caps, and add capabilities that were previously uneconomical has opened significantly.
If you built your competitive moat on AI cost advantages: Cheap inference is a threat. If your pricing was partly justified by your ability to run AI at scale more efficiently than competitors, that moat erodes when frontier inference drops to commodity pricing. The analogy is companies that built competitive advantages in the 2000s by running their own data centers—when AWS commoditized cloud compute, that advantage disappeared. Teams in this position need to examine how much of their retention is driven by AI capability itself (durable, if the capability is genuinely differentiated) versus AI availability (less durable, as access becomes near-universal).
If you are buying AI SaaS products: The price war creates negotiation leverage. Enterprise contracts negotiated in 2024 may have included AI cost pass-throughs at inflated underlying model rates. As those contracts come up for renewal, the right posture is to audit what AI workloads you are actually paying for, compare the current underlying model prices against what is implied in your vendor's per-seat pricing, and negotiate explicitly for cost-of-delivery adjustments. Per-resolution pricing models—like those used by Intercom Fin, Zendesk AI, and Agentforce—are particularly sensitive to this dynamic: if the underlying model cost falls 90% but the per-resolution rate stays fixed, the entire reduction flows to the vendor's margin.
The Caching and Batching Multiplier
List prices understate the real cost reductions available to sophisticated buyers, because two optimization techniques compound the headline discounts significantly.
Prompt caching stores the KV-cache from large, stable context windows—system prompts, knowledge base documents, long tool descriptions—and reuses them across inference calls without recomputing. Anthropic charges 90% less for cached input tokens. OpenAI charges 50% less. For applications with 20,000+ token system prompts running thousands of daily calls, caching alone can reduce effective per-call input costs by 60-75%.
Batch processing applies to workloads that do not require real-time responses: classification jobs, overnight analysis pipelines, document processing queues. All three major providers offer 50% discounts for asynchronous batch inference with 24-hour SLAs. For teams running nightly batch AI processes—enrichment, scoring, summarization—migrating those workloads to batch APIs is a straightforward 50% cost reduction with no capability trade-off.
Combining caching and batching for the right workloads can reduce effective inference costs to 10-20% of the headline rate. A workload running 50,000-token cached prompts in batch mode effectively costs $0.01-$0.02 per million input tokens—not a rounding error in the AI budget, but a structural change in what is economically deployable.
The Hidden Cost: Context Window and Tool-Call Overhead
The list-price table tells only part of the story. Infrastructure teams that have audited their AI spend against invoices consistently find three categories of costs not reflected in per-token rates.
Context window overhead. Models billed by token count charge for the entire context window on every call, including accumulated conversation history and tool results. Applications that naively append all conversation history to every call can face 3-5x the effective token cost compared to applications that truncate, summarize, or selectively include context. This is not a theoretical concern—it is the most common source of production cost overruns for teams moving from small pilots to high-volume deployments.
Tool-call and function-calling overhead. When models use tool calling to interact with external APIs, the tool definition tokens and tool call results are billed as part of the context window. A model that makes three tool calls per response, each with a 2,000-token tool definition schema, adds 6,000 tokens of overhead to every response. At $5/M tokens (Opus 4.8 input rate), that is $0.03 per interaction in tool overhead alone—before the actual task tokens.
Embedded provider features. Several providers now bundle value-added features that are billed separately or at premium rates: OpenAI's fine-tuning API, Anthropic's computer use capability, Google's grounding with search. Teams that benchmark raw token prices and then add premium features without recalculating often discover material cost surprises in production.
The Vendor Selection Playbook for 2026
The multi-provider landscape has made vendor selection genuinely complex. Here is the framework Signal recommends for teams currently evaluating or re-evaluating their AI stack.
1. Tier your workloads by latency and quality requirements. Separate workloads into real-time user-facing (latency-sensitive, quality-critical), near-real-time background (tolerable latency, quality-important), and batch/async (flexible latency, quality-configurable). Each tier has different optimal economics. The cheapest model suitable for each tier is often not the cheapest model available overall.
2. Benchmark on your actual production prompts, not published benchmarks. MMLU, HumanEval, and GPQA measure academic task performance. Your prompts measure a different distribution. The model that tops the benchmarks may not be the model that maximizes quality per dollar on your specific task. Run each candidate model on 500-1,000 representative production examples and measure output quality per dollar at the actual token lengths your application generates.
3. Build for provider portability now. Prompt engineering done for one model's idiosyncrasies may not transfer cleanly to a different model family. Teams switching from GPT-4o to Claude Sonnet 4.6 or Gemini 2.5 Pro typically need to revalidate their system prompts. Building application layers that abstract the model interface—treating each provider as a swappable backend rather than a hardcoded dependency—creates optionality as the price curve continues moving.
4. Negotiate explicit price-reduction pass-through clauses in enterprise contracts. Model pricing is moving fast enough that multi-year contracts without price adjustment language are a one-sided bet against you. Standard enterprise contracts with OpenAI, Anthropic, and Google now include most-favored-nation pricing provisions that pass future list price reductions to enterprise customers. If your current contract does not have this language, request it at next renewal.
5. Monitor provider concentration risk. At current pricing, the economically optimal AI stack often involves a primary model and one or two fallback providers. Routing logic that degrades gracefully from a primary provider to a secondary on latency spikes, error rates, or cost thresholds is operationally mature practice. Anthropic's Fable 5 pricing transition illustrates how quickly provider economics can shift: even teams that have selected a primary provider need contingency routing for the scenarios where that provider changes pricing structure, access model, or capacity constraints.
The Competitive Dynamics for AI-Native SaaS
The price war's most interesting second-order effect is what it does to competitive moats in the application layer.
In 2024, one viable AI startup strategy was to compress margins in the short term, betting that the underlying model costs would fall before the unit economics became terminal. That bet has largely paid off for teams that made it: the underlying costs fell faster than many expected. But the same falling costs that vindicated the strategy now erode the competitive advantage of having survived the expensive period—because anyone starting today faces the same cheap-inference starting conditions.
The implication is that application-layer AI products need to compete on dimensions other than AI capability itself. The dimension most durable against commoditization is workflow integration: the organizational knowledge, data connections, and process embedding that make switching from your product expensive even if a competitor can match your AI features at a lower price. The Google Gemini API pricing structure and competitive alternatives show the same market reality: the floor for capable AI inference is now so low that the cost of building AI-powered features is no longer the primary barrier to competitive entry. The primary barrier is now product quality and distribution.
Teams that have spent the last 18 months building on the assumption that AI capability is the moat should revisit that assumption now. The commodity price of intelligence is $0.10 per million tokens and falling. The non-commodity elements of an AI product—data moats, integration depth, workflow fit, organizational trust—do not fall with it.
How to Approach the Next Quarterly AI Budget Review
The token price war is not a theoretical future event. The economics have already shifted. Teams that have not recalibrated their AI cost models since 2024 may be operating on assumptions that are 80-90% wrong in dollar terms.
Three concrete actions for the next quarter:
Audit your current AI spend vs. current list prices. Pull your model API invoices from the past six months. Map each line item to the current list price for the same model tier. The gap between what you are paying (potentially on older contracts or unoptimized integrations) and what is available today often represents 40-60% cost reduction with no code changes—just contract renegotiation or provider migration.
Implement caching for any application with stable large system prompts. If your system prompts exceed 10,000 tokens and you are running more than 1,000 daily calls, prompt caching is likely the highest-ROI infrastructure change available to you. Both OpenAI and Anthropic make it implementation-simple: prefix the cacheable portion of the prompt and enable the cache flag in the API call.
Build a workload tiering map. Document each AI-powered feature or pipeline in your product and classify it by latency requirement (real-time, near-real-time, batch-suitable) and quality threshold (frontier model required, mid-tier sufficient, lightweight acceptable). This map is the input to an optimization pass that most teams have never formally done—and at current price spreads between Flash-Lite ($0.10/M) and Opus ($5.00/M), the stakes of getting the tier right have grown 50x since 2024. According to independent tracking at pricepertoken.com, the spread between cheapest and most expensive frontier models has never been wider.
Takeaway: The 90% collapse in AI inference pricing over 18 months is the fastest commodity re-pricing in enterprise software history. For buyers, it creates immediate audit opportunities and negotiation leverage. For builders, it removes cost barriers to broader AI feature deployment while simultaneously removing cost as a durable competitive moat. The teams that extract the most value from this moment are those that treat cheap inference not as a margin gift but as permission to invest aggressively in the dimensions—data, workflow integration, trust—that commodity pricing cannot touch.
Frequently Asked Questions
How much have AI inference prices dropped since 2024?
The price compression has been dramatic and consistent across providers. OpenAI's o1, the state-of-the-art reasoning model in early 2024, cost $15 per million input tokens and $60 per million output tokens. By mid-2026, its successor o3 costs $2 input and $8 output—an 87% reduction. The o4-mini model, which matches or exceeds o1 on most benchmarks, costs just $1.10 per million input tokens: a 93% price reduction for superior capability. Anthropic followed a parallel trajectory: Claude Opus 4.1 launched at $15 input and $75 output, and by the February 2026 Opus 4.6 release that had been cut 67% to $5 and $25. Google's Gemini 2.5 Flash costs just $0.30 and $2.50 per million tokens, and Flash-Lite runs at $0.10 and $0.40. The speed of this compression—from $15 to under $1.10 on input in roughly 18 months—significantly outpaces historical analogies like cloud compute or storage, which took years to cover similar percentages.
What does the AI token price war mean for SaaS companies building on AI APIs?
For SaaS products that embed AI via API, cheap inference cuts both ways. On the upside, the gross margin math that looked untenable at 2024 pricing becomes viable in 2026: a feature that cost $0.02 per user interaction at $15/M tokens now costs $0.0013 at $1/M, making previously uneconomical feature sets profitable at scale. On the downside, if your competitors face the same cost curve, cheaper inference does not produce durable margin advantages—it accelerates the pressure on feature differentiation. The companies that benefit most are those who use the cost reduction to expand usage volume rather than simply reduce COGS. Products that previously rate-limited AI features for cost reasons can now lift those caps and study which incremental usage drives retention. The teams that get hurt are those whose pricing was built on AI as a cost moat—cheaper inference removes the moat and exposes the product.
Should companies switch AI providers based on token price alone?
Token price is one input into provider selection, not the primary one. The right selection framework weighs five dimensions: capability fit (which model actually solves your use case reliably), latency (output tokens per second under load), context window economics (some tasks require 200K+ context, which changes the price math significantly), reliability and uptime SLAs (critical for production workloads), and ecosystem lock-in risk (how easy is it to swap if the model degrades or pricing changes). On pure price, Gemini 2.5 Flash-Lite at $0.10/$0.40 is the cheapest frontier option today, but that does not mean it is the right answer for reasoning-heavy tasks. A sensible approach is to tier your workloads: use cheaper, faster models for high-volume, lower-stakes inference and reserve expensive models for high-stakes generation. Most teams running single-model stacks are overpaying on some workloads and underperforming on others.
What is prompt caching and how much can it reduce AI inference costs?
Prompt caching is a technique where the provider stores the KV cache from a long system prompt or context prefix and reuses it across multiple inference calls, rather than recomputing the attention layers on every request. For applications with a large, stable system prompt—a detailed tool description, a product knowledge base, or a long document the model references repeatedly—caching can reduce the cost of those cached tokens by 85 to 90 percent. OpenAI charges 50% less for cached input tokens; Anthropic charges 90% less. Google Gemini offers similar discounts with its context caching feature. In practice, an application with a 50,000-token system prompt making 10,000 daily API calls could see effective per-call costs drop by 60-70% once the prompt cache is warm. Batch processing provides a complementary saving: all major providers now offer 50% discounts for asynchronous batch inference workloads, with 24-hour SLAs. Combining caching with batching for appropriate workloads can reduce effective inference costs by 70-80% below the headline token rate.
How does the inference price war affect enterprise AI procurement in 2026?
Enterprise AI procurement has shifted from what can we afford to run to what do we need these models to do, and what is the optimal tier for each workload. In 2024, most enterprise conversations were about whether to build AI features at all, given uncertain cost trajectories. In 2026, the cost conversation has moved one level of abstraction up: teams now negotiate around commitment discounts (enterprise commitments for 30-40% discounts), prompt caching strategies, batch processing tier eligibility, and multi-model routing architectures that balance cost and quality dynamically. The vendor selection calculus has also changed: providers are competing not just on model capability but on cost optimization tooling, with prompt caching dashboards, batch API performance guarantees, and cost monitoring APIs becoming standard procurement criteria. Enterprise contracts now typically include most-favored-nation clauses for future price reductions, reflecting how quickly the landscape changes.